Semiautomated relay method and apparatus

ABSTRACT

A captioning method for presenting captions to an assisted user (AU) during communication with a hearing user (HU) where the assisted user uses a captioned device and the hearing user uses a hearing user&#39;s device to facilitate the communication, the captioned device including a display screen and a speaker for presenting captions and broadcasting the hearing user&#39;s voice signals, respectively, the method comprising the steps of during an ongoing call between the AU and the HU, using an automated speech recognition (ASR) engine to generate initial ASR captions associated with the HU&#39;s voice signal, assessing at least one caption quality factor associated with prior initial ASR captions generated during the ongoing call, delaying broadcast of HU voice signal to the AU and based on the at least one caption quality factor, adjusting a duration of the HU voice signal broadcast delay.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.16/422,662, filed on May 24, 2019, and titled “SEMIAUTOMATED RELAYMETHOD AND APPARATUS,” which is a continuation-in-part of U.S. patentapplication Ser. No. 15/982,239, filed on May 17, 2018, and which istitled “SEMIAUTOMATED RELAY METHOD AND APPARATUS,” which is acontinuation-in-part of U.S. patent application Ser. No. 15/729,069,filed on Oct. 10, 2017, and which is titled “SEMIAUTOMATED RELAY METHODAND APPARATUS,” which is a continuation-in-part of U.S. patentapplication Ser. No. 15/171,720, filed on Jun. 2, 2016, and titled“SEMIAUTOMATED RELAY METHOD AND APPARATUS”, which is acontinuation-in-part of U.S. patent application Ser. No. 14/953,631,filed on Nov. 30, 2015, and titled “SEMIAUTOMATED RELAY METHOD ANDAPPARATUS,” which is a continuation-in-part of U.S. patent applicationSer. No. 14/632,257, filed on Feb. 26, 2015, issued as U.S. Pat. No.10,389,876 on Aug. 20, 2019, and titled “SEMIAUTOMATED RELAY METHOD ANDAPPARATUS,” which claims priority to U.S. provisional patent applicationSer. No. 61/946,072, filed on Feb. 28, 2014, and titled “SEMIAUTOMATEDRELAY METHOD AND APPARATUS,” and claims priority to each of the aboveapplications, each of which is incorporated herein in its entirety byreference.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

Not applicable.

BACKGROUND OF THE DISCLOSURE

The present invention relates to relay systems for providingvoice-to-text captioning for hearing impaired users and morespecifically to a relay system that uses automated voice-to-textcaptioning software to transcribe voice-to-text.

Many people have at least some degree of hearing loss. For instance, inthe United states, about 3 out of every 1000 people are functionallydeaf and about 17 percent (36 million) of American adults report somedegree of hearing loss which typically gets worse as people age. Manypeople with hearing loss have developed ways to cope with the ways theirloss effects their ability to communicate. For instance, many deafpeople have learned to use their sight to compensate for hearing loss byeither communicating via sign language or by reading another person'slips as they speak.

When it comes to remotely communicating using a telephone,unfortunately, there is no way for a hearing impaired person (e.g., anassisted user (AU)) to use sight to compensate for hearing loss asconventional telephones do not enable an AU to see a person on the otherend of the line (e.g., no lip reading or sign viewing). For persons withonly partial hearing impairment, some simply turn up the volume on theirtelephones to try to compensate for their loss and can make do in mostcases. For others with more severe hearing loss conventional telephonescannot compensate for their loss and telephone communication is a pooroption.

An industry has evolved for providing communication services to AUswhereby voice communications from a person linked to an AU'scommunication device are transcribed into text and displayed on anelectronic display screen for the AU to read during a communicationsession. In many cases the AU's device will also broadcast the linkedperson's voice substantially simultaneously as the text is displayed sothat an AU that has some ability to hear can use their hearing sense todiscern most phrases and can refer to the text when some part of acommunication is not understandable from what was heard.

U.S. Pat. No. 6,603,835 (hereinafter “the '835 patent) titled “SystemFor Text Assisted Telephony” teaches several different types of relaysystems for providing text captioning services to AUs. One captioningservice type is referred to as a single line system where a relay islinked between an AU's device and a telephone used by the personcommunicating with the AU. Hereinafter, unless indicated otherwise theother person communicating with the AU will be referred to as a hearinguser (HU) even though the AU may in fact be communicating with anotherAU. In single line systems, one line links an HU device to the relay andone line (e.g., the single line) links the relay to the AU device. Voicefrom the HU is presented to a relay call assistant (CA) who transcribesthe voice-to-text and then the text is transmitted to the AU device tobe displayed. The HU's voice is also, in at least some cases, carried orpassed through the relay to the AU device to be broadcast to the AU.

The other captioning service type described in the '835 patent is a twoline system. In a two line system a HU's telephone is directly linked toan AU's device via a first line for voice communications between the AUand the HU. When captioning is required, the AU can select a captioningcontrol button on the AU device to link to the relay and provide theHU's voice to the relay on a second line. Again, a relay CA listens tothe HU voice message and transcribes the voice message into text whichis transmitted back to the AU device on the second line to be displayedto the AU. One of the primary advantages of the two line system over oneline systems is that the AU can add captioning to an on-going call. Thisis important as many AUs are only partially impaired and may only wantcaptioning when absolutely necessary. The option to not have captioningis also important in cases where an AU device can be used as a normaltelephone and where non-AUs (e.g., a spouse living with an AU that hasgood hearing capability) that do not need captioning may also use the AUdevice.

With any relay system, the primary factors for determining the value ofthe system are accuracy, speed and cost to provide the service.Regarding accuracy, text should accurately represent spoken messagesfrom HUs so that an AU reading the text has an accurate understanding ofthe meaning of the message. Erroneous words provide inaccurate messagesand also can cause confusion for an AU reading transcribed text.

Regarding speed, ideally text is presented to an AU simultaneously withthe voice message corresponding to the text so that an AU sees textassociated with a message as the message is heard. In this regard, textthat trails a voice message by several seconds can cause confusion.Current systems present captioned text relatively quickly (e.g. 1-3seconds after the voice message is broadcast) most of the time. However,at times a CA can fall behind when captioning so that longer delays(e.g., 10-15 seconds) occur.

Regarding cost, existing systems require a unique and highly trained CAfor each communication session. In known cases CAs need to be able tospeak clearly and need to be able to type quickly and accurately. CAjobs are also relatively high pressure jobs and therefore turnover isrelatively high when compared jobs in many other industries whichfurther increases the costs associated with operating a relay.

One innovation that has increased captioning speed appreciably and thathas reduced the costs associated with captioning at least somewhat hasbeen the use of voice-to-text transcription software by relay CAs. Inthis regard, early relay systems required CAs to type all of the textpresented via an AU device. To present text as quickly as possible afterbroadcast of an associated voice message, highly skilled typists wererequired. During normal conversations people routinely speak at a ratebetween 110 and 150 words per minute. During a conversation between anAU and an HU, typically only about half the words voiced have to betranscribed (e.g., the AU typically communicates to the HU during halfof a session). Because of various inefficiencies this means that to keepup with transcribing the HU's portion of a typical conversation a CA hasto be able to type at around 100 words per minute or more. To this end,most professional typists type at around 50 to 80 words per minute andtherefore can keep up with a normal conversation for at least some time.Professional typists are relatively expensive. In addition, despitebeing able to keep up with a conversation most of the time, at othertimes (e.g., during long conversations or during particularly high speedconversations) even professional typists fall behind transcribing realtime text and more substantial delays can occur.

In relay systems that use voice-to-text transcription software trainedto a CA's voice, a CA listens to an HU's voice and revoices the HU'svoice message to a computer running the trained software. The software,being trained to the CA's voice, transcribes the re-voiced message muchmore quickly than a typist can type text and with only minimal errors.In many respects revoicing techniques for generating text are easier andmuch faster to learn than high speed typing and therefore training costsand the general costs associated with CA's are reduced appreciably. Inaddition, because revoicing is much faster than typing in most cases,voice-to-text transcription can be expedited appreciably using revoicingtechniques.

At least some prior systems have contemplated further reducing costsassociated with relay services by replacing CA's with computers runningvoice-to-text software to automatically convert HU voice messages totext. In the past there have been several problems with this solutionwhich have resulted in no one implementing a workable system. First,most voice messages (e.g., an HU's voice message) delivered over mosttelephone lines to a relay are not suitable for direct voice-to-texttranscription software. In this regard, automated transcription softwareon the market has been tuned to work well with a voice signal thatincludes a much larger spectrum of frequencies than the range used intypical phone communications. The frequency range of voice signals onphone lines is typically between 300 and 3000 Hz. Thus, automatedtranscription software does not work well with voice signals deliveredover a telephone line and large numbers of errors occur. Accuracyfurther suffers where noise exists on a telephone line which is a commonoccurrence.

Second, many automated transcription software programs have to betrained to the voice of a speaker to be accurate. When a new HU calls anAU's device, there is no way for a relay to have previously trainedsoftware to the HU voice and therefore the software cannot accuratelygenerate text using the HU voice messages.

Third, many automated transcription software packages use context inorder to generate text from a voice message. To this end, the wordsaround each word in a voice message can be used by software as contextfor determining which word has been uttered. To use words around a firstword to identify the first word, the words around the first word have tobe obtained. For this reason, many automated transcription systems waitto present transcribed text until after subsequent words in a voicemessage have been transcribed so that context can be used to correctprior words before presentation. Systems that hold off on presentingtext to correct using subsequent context cause delay in textpresentation which is inconsistent with the relay system need for realtime or close to real time text delivery.

BRIEF SUMMARY OF THE DISCLOSURE

It has been recognized that a hybrid semi-automated system can beprovided where, when acceptable accuracy can be achieved using automatedtranscription software, the system can automatically use thetranscription software to transcribe HU voice messages to text and whenaccuracy is unacceptable, the system can patch in a human CA totranscribe voice messages to text. Here, it is believed that the numberof CAs required at a large relay facility may be reduced appreciably(e.g., 30% or more) where software can accomplish a large portion oftranscription to text. In this regard, not only is the automatedtranscription software getting better over time, in at least some casesthe software may train to an HU's voice and the vagaries associated withvoice messages received over a phone line (e.g., the limited 300 to 3000Hz range) during a first portion of a call so that during a laterportion of the call accuracy is particularly good. Training may occurwhile and in parallel with a CA manually (e.g., via typing, revoicing,etc.) transcribing voice-to-text and, once accuracy is at an acceptablethreshold level, the system may automatically delink from the CA and usethe text generated by the software to drive the AU display device.

It has been recognized that in a relay system there are at least twoprocessors that may be capable of performing automated voice recognitionprocesses and therefore that can handle the automated voice recognitionpart of a triage process involving a CA. To this end, in most caseseither a relay processor or an AU's device processor may be able toperform the automated transcription portion of a hybrid process. Forinstance, in some cases an AU's device will perform automatedtranscription in parallel with a relay assistant generating CA generatedtext where the relay and AU's device cooperate to provide text andassess when the CA should be cut out of a call with the automated textreplacing the CA generated text.

In other cases where a HU's communication device is a computer orincludes a processor capable of transcribing voice messages to text, aHU's device may generated automated text in parallel with a CAgenerating text and the HU's device and the relay may cooperate toprovide text and determine when the CA should be cut out of the call.

Regardless of which device is performing automated captioning, the CAgenerated text may be used to assess accuracy of the automated text forthe purpose of determining when the CA should be cut out of the call. Inaddition, regardless of which device is performing automated textcaptioning, the CA generated text may be used to train the automatedvoice-to-text software or engine on the fly to expedite the process ofincreasing accuracy until the CA can be cut out of the call.

It has also been recognized that there are times when a hearing impairedperson is listening to a HU's voice without an AU's device providingsimultaneous text when the AU is confused and would like transcriptionof recent voice messages of the HU. For instance, where an AU uses anAU's device to carry on a non-captioned call and the AU has difficultyunderstanding a voice message so that the AU initiates a captioningservice to obtain text for subsequent voice messages. Here, while textis provided for subsequent messages, the AU still cannot obtain anunderstanding of the voice message that prompted initiation ofcaptioning. As another instance, where CA generated text lagsappreciably behind a current HU's voice message, an AU may request thatthe captioning catch up to the current message.

To provide captioning of recent voice messages in these cases, in atleast some embodiments of this disclosure an AU's device stores an HU'svoice messages and, when captioning is initiated or a catch up requestis received, the recorded voice messages are used to eitherautomatically generate text or to have a CA generate text correspondingto the recorded voice messages.

In at least some cases when automated software is trained to a HU'svoice, a voice model for the HU that can be used subsequently to tuneautomated software to transcribe the HU's voice may be stored along witha voice profile for the HU that can be used to distinguish the HU'svoice from other HUs. Thereafter, when the HU calls an AU's deviceagain, the profile can be used to identify the HU and the voice modelcan be used to tune the software so that the automated software canimmediately start generating highly accurate or at least relatively moreaccurate text corresponding to the HU's voice messages.

A relay for captioning a hearing user's (HU's) voice signal during aphone call between an HU and a hearing assisted user (AU), the HU usingan HU device and the AU using an AU device where the HU voice signal istransmitted from the HU device to the AU device, the relay comprising adisplay screen, a processor linked to the display and programmed toperform the steps of receiving the HU voice signal from the AU device,transmitting the HU voice signal to a remote automatic speechrecognition (ASR) server running ASR software that converts the HU voicesignal to ASR generated text, the remote ASR server located at a remotelocation from the relay, receiving the ASR generated text from the ASRserver, present the ASR generated text for viewing by a call assistant(CA) via the display and transmitting the ASR generated text to the AUdevice.

In at least some embodiments the relay further includes an interfacethat enables a CA to make changes to the ASR generated text presented onthe display. In some cases the processor is further programmed totransmit CA corrections made to the ASR generated text to the AU devicewith instructions to modify the ASR generated text previously sent tothe AU device. In some cases the relay separates the HU voice signalinto voice signal slices, the step of transmitting the HU voice signalto the ASR server includes independently transmitting the voice signalslices to the remote ASR server for captioning and wherein the step ofreceiving the ASR generated text from the relay includes receivingseparate ASR generated text segments for each of the slices and cobblingthe separate segments together to form a stream of ASR generated text.

In some cases at least some of the voice signal slices overlap. In somecases at least some of the voice signal slices are relatively short andsome of the voice signal slices are relatively long and wherein theshort voice signal slices are consecutive and do not overlap and whereinat least some relatively long voice signal slices overlap at least firstand second of the relatively short voice signal slices. In some cases atleast some of the ASR generated text associated with overlapping voicesignal slices is inconsistent, the relay applying a rule set to identifywhich inconsistent ASR generated text to use in the stream of ASRgenerated text.

In some cases the ASR server generates ASR error corrections for the ASRgenerated text, the relay further programmed to perform the steps ofreceiving ASR error corrections, using the error corrections toautomatically correct at least some of the errors in the ASR generatedtext on the display screen and transmitting the ASR error corrections tothe AU device. In at least some embodiments the relay further includesan interface that enables a CA to make changes to the ASR generated textpresented on the display, the processor further programmed to transmitCA corrections made to the ASR generated text to the AU device withinstructions to modify the ASR generated text previously sent to the AUdevice. In some cases, after a CA makes a change to ASR generated text,the text prior thereto becomes firm so that no ASR error corrections aremade to the text subsequent thereto.

In some cases the relay further includes a speaker and wherein theprocessor broadcasts the HU voice signal to the CA via the speaker asthe ASR generated text is presented on the display screen. In some casesthe processor aligns broadcast of the HU voice signal with ASR generatedtext presented on the display screen. In some cases the processorpresents the ASR generated text on the on the display screen immediatelyupon reception and transmits the ASR generated text immediately uponreception and broadcasts the HU voice signal under control of the CAusing an interface. In some cases, as word in the HU voice signal isbroadcast to the CA, text corresponding to the broadcast word in on thedisplay screen is visually distinguished from other text on the displayscreen.

Other embodiment include a relay for captioning a hearing user's (HU's)voice signal during a phone call between an HU and a hearing assisteduser (AU), the HU using an HU device and the AU using an AU device wherethe HU voice signal is transmitted from the HU device to the AU device,the relay comprising a display screen, an interface device, a processorlinked to the display screen and the interface device, the processorprogrammed to perform the steps of receiving the HU voice signal fromthe AU device, separating the HU voice signal into voice signal slices,separately transmitting the HU voice signal slices to a remote automaticspeech recognition (ASR) server that is located at a remote locationfrom the relay, receiving separate ASR generated text segments for eachof the slices and cobbling the separate segments together to form astream of ASR generated text, present the stream of ASR generated textas it is received from the ASR server for viewing by a call assistant(CA) via the display and transmitting the stream of ASR generated textto the AU device as the stream is received from the relay.

In some cases ASR error corrections to the ASR generated text arereceived from the ASR server and at least some of the ASR errorcorrections are used to correct the text on the display, the relayreceives CA error corrections to the text on the display and uses thosecorrections to correct text on the display. In some cases, once a CAcorrects an error in the text on the display, ASR error corrections fortext prior to the CA corrected text on the display are not used to makeerror corrections on the display. In some cases all ASR generated textpresented on the display is transmitted to the AU device and all ASRerror corrections and CA text corrections that are presented on thedisplay are transmitted as correction text to the AU device.

Some embodiment include an caption device for use by a hard of hearingassisted user (AU) to assist the AU during voice communications with ahearing user (HU) using an HU device, the caption device comprising adisplay screen, a memory, at least one communication link element forlinking to a communication network, a speaker, a processor linked toeach of the display screen, the memory, the speaker and thecommunication link, the processor programmed to perform the steps ofreceiving an HU voice signal from the HU device during a call,broadcasting the HU voice signal to the AU via the speaker, storing atleast a most recent portion of the HU voice signal in the memory,receiving a command from the AU to start a captioning session, uponreceiving the command, obtaining a text caption corresponding to thestored HU voice signal and presenting the text caption to the AU via thedisplay.

In some cases the step of obtaining a text caption includes initiating aprocess whereby an automated speech recognition (ASR) program convertsthe stored HU voice signal to text. In some cases the processor runs theASR program. In some cases the step of initiating the process includesestablishing a link to a remote relay, and transmitting the stored HUvoice signal to the relay, the step of obtaining further includingreceiving the text caption from the relay. In at least some embodimentsthe relay further includes, subsequent to receiving the command,obtaining text captions for additional HU voice signals received duringthe ongoing call. In some cases the step of obtaining text caption ofthe stored HU voice signal includes initiating a process whereby the HUvoice signal is converted to text via an automatic speech recognition(ASR) engine and wherein the step of obtaining text captions formadditional HU voice signal received during the ongoing call furtherincludes transmitting the additional HU voice signal to a relay andreceiving text captions back from the relay.

To the accomplishment of the foregoing and related ends, the disclosure,then, comprises the features hereinafter fully described. The followingdescription and the annexed drawings set forth in detail certainillustrative aspects of the disclosure. However, these aspects areindicative of but a few of the various ways in which the principles ofthe invention can be employed. Other aspects, advantages and novelfeatures of the disclosure will become apparent from the followingdetailed description of the invention when considered in conjunctionwith the drawings.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 is a schematic showing various components of a communicationsystem including a relay that may be used to perform various processesand methods according to at least some aspects of the present invention;

FIG. 2 is a schematic of the relay server shown in FIG. 1;

FIG. 3 is a flow chart showing a process whereby an automatedvoice-to-text engine is used to generate automated text in parallel witha CA generating text where the automated text is used instead of CAgenerated text to provide captioning an AU's device once an accuracythreshold has been exceeded;

FIG. 4 is a sub-process that maybe substituted for a portion of theprocess shown in FIG. 3 whereby a control assistant can determinewhether or not the automated text takes over the process after theaccuracy threshold has been achieved;

FIG. 5 is a sub-process that may be added to the process shown in FIG. 3wherein, upon an AU's requesting help, a call is linked to a second CAfor correcting the automated text;

FIG. 6 is a process whereby an automated voice-to-text engine is used tofill in text for a HU's voice messages that are skipped over by a CAwhen an AU requests instantaneous captioning of a current message;

FIG. 7 is a process whereby automated text is automatically used to fillin captioning when transcription by a CA lags behind a HU's voicemessages by a threshold duration;

FIG. 8 is a flow chart illustrating a process whereby text is generatedfor a HU's voice messages that precede a request for captioningservices;

FIG. 9 is a flow chart illustrating a process whereby voice messagesprior to a request for captioning service are automatically transcribedto text by an automated voice-to-text engine;

FIG. 10 is a flow chart illustrating a process whereby an AU's deviceprocessor performs transcription processes until a request forcaptioning is received at which point the AU's device presents textsrelated to HU voice messages prior to the request and ongoing voicemessages are transcribed via a relay;

FIG. 11 is a flow chart illustrating a process whereby an AU's deviceprocessor generates automated text for a hear user's voice messageswhich is presented via a display to an AU and also transmits the text toa CA at a relay for correction purposes;

FIG. 12 is a flow chart illustrating a process whereby high definitiondigital voice messages and analog voice messages are handled differentlyat a relay;

FIG. 13 is a process similar to FIG. 12, albeit where an AU also has theoption to link to a CA for captioning service regardless of the type ofvoice message received;

FIG. 14 is a flow chart that may be substituted for a portion of theprocess shown in FIG. 3 whereby voice models and voice profiles aregenerated for frequent HU's that communicate with an AU where the modelsand profiles can be subsequently used to increase accuracy of atranscription process;

FIG. 15 is a flow chart illustrating a process similar to thesub-process shown in FIG. 14 where voice profiles and voice models aregenerated and stored for subsequent use during transcription;

FIG. 16 is a flow chart illustrating a sub-process that may be added tothe process shown in FIG. 15 where the resulting process calls fortraining of a voice model at each of an AU's device and a relay;

FIG. 17 is a schematic illustrating a screen shot that may be presentedvia an AU's device display screen;

FIG. 18 is similar to FIG. 17, albeit showing a different screen shot;

FIG. 19 is a process that may be performed by the system shown in FIG. 1where automated text is generated for line check words and is presentedto an AU immediately upon identification of the words;

FIG. 20 is similar to FIG. 17, albeit showing a different screen shot;

FIG. 21 is a flow chart illustrating a method whereby an automatedvoice-to-text engine is used to identify errors in CA generated textwhich can be highlighted and can be corrected by a CA;

FIG. 22 is an exemplary AU device display screen shot that illustratesvisually distinct text to indicate non-textual characteristics of an HUvoice signal to an AU;

FIG. 23 is an exemplary CA workstation display screen shot that showshow automated ASR text associated with an instantaneously broadcast wordmay be visually distinguished for an error correcting CA;

FIG. 23A is a screen shot of a CA interface providing an option toswitch from ASR generated text to a full CA system where a CA generatescaption text;

FIG. 24 shows an exemplary HU communication device with CA captioned HUtext and ASR generated AU text presented as well as other communicationinformation that is consistent with at least some aspects off thepresent disclosure;

FIG. 25 is an exemplary CA workstation display screen shot similar toFIG. 23, albeit where a CA has corrected an error and an HU voice signalplayback has been skipped backward as a function of where the correctionoccurred;

FIG. 26 is a screen shot of an exemplary AU device display that presentsCA captioned HU text as well as ASR engine generated AU text;

FIG. 27 is an illustration of an exemplary HU device that shows textcorresponding to the HU's voice signal as well as an indication of whichword in the text has been most recently presented to an AU;

FIG. 28 is a schematic diagram showing a relay captioning system that isconsistent with at least some aspects of the present disclosure;

FIG. 28A includes a flowchart of a process that is consistent with atleast some aspects of the present disclosure;

FIG. 28B includes a flowchart of a process that is consistent with atleast some aspects of the present disclosure;

FIG. 29 is a schematic diagram of a relay system that includes a texttranscription quality assessment function that is consistent with atleast some aspects of the present disclosure;

FIG. 30 is similar to FIG. 29, albeit showing a different relay systemthat includes a different quality assessment function;

FIG. 31 is similar to FIG. 29, albeit showing a third relay system thatincludes a third quality assessment function;

FIG. 32 is a flow chart illustrating a method whereby time stamps areassigned to HU voice segments which are then used to substantiallysynchronize text and voice presentation;

FIG. 33 is a schematic illustrating a caption relay system that mayimplement the method illustrated in FIG. 32 as well as other methodsdescribed herein;

FIG. 34 is a sub process that may be substituted for a portion of theFIG. 32 process where an Au device assigns a sequence of time stamps toa sequence of text segments;

FIG. 35 is another flow chart illustrating another method for assigningand using time stamps to synchronize text and HU voice broadcast;

FIG. 36 is a screen shot illustrating a CA interface where a prior wordis selected to be rebroadcast;

FIG. 37 is a screen shot similar to FIG. 36, albeit of an Au devicedisplay showing an AU selecting a prior broadcast phrase forrebroadcast;

FIG. 38 is another sub process that may be substituted for a portion ofthe FIG. 32 method;

FIG. 39 is a screen shot showing a CA interface where various inventivefeatures are shown;

FIG. 40 is a screen shot illustrating another CA interface where low andhigh confidence text is presented in different columns to help a CA moreeasily distinguish between text likely to need correction and text thatis less likely to need correction;

FIG. 40A is a screen shot of a CA interface showing low confidencecaption text visually distinguished from other text presented to a CAfor correction consideration, among other things;

FIG. 40B shows screen shots presented via a CA workstation interfacethat are consistent with at least some aspects of the presentdisclosure;

FIG. 40C is similar to FIG. 40B, albeit showing a different screen shotpresented to a CA during caption error correction;

FIG. 41 is a flow chart illustrating a method of introducing errors inASR generated text to text CA attention;

FIG. 42 is a screen shot illustrating an AU interface including, inaddition to text presentation, an HU video field and a CA signing fieldthat is consistent with at least some aspects of the present disclosure;

FIG. 43 is a screen shot illustrating yet another CA interface;

FIG. 44 is another AU interface screen shot including scrolling text andan HU video window; and

FIG. 45 is another CA interface screen shot showing a CA correctionfield, an ASR uncorrected text field and an intervening time field thatis consistent with at least some aspects of the present disclosure;

FIG. 46 is a schematic illustrating different phrase slices that may beformed that is consistent with at least some aspects of the presentdisclosure;

FIG. 47 is a screen shot illustrating an interface presented to a CAthat includes various transcription feedback tools that are consistentwith various aspects of the present disclosure;

FIG. 48 is a screen shot illustrating an interface presented to an AUthat indicates a transition from automated text to CA generated textthat is consistent with at least some aspects of the present disclosure;

FIG. 49 is similar to FIG. 48, albeit illustrating an interface thatindicates a transition from automated text to CA corrected text that isconsistent with at least some aspects of the present disclosure;

FIG. 50 is a screen shot showing a CA interface that, among otherthings, enables a CA to select specific points in ASR generated text tofirm up prior ASR generated text;

FIG. 51 is a screen shot illustrating an administrators interface thatshows results of CA generated text and scoring tools used to assessquality of captions generated by a CA;

FIG. 52 is a screen shot illustrating a CA interface where a CA isrestricted to editing text within a small field of recent text to ensurethat the CA keeps up with current HU voice utterances within some windowof time;

FIG. 53 is similar to FIG. 52, albeit showing the interface at adifferent point in time;

FIG. 54 is a top plan view of a CA workstation including an eye trackingcamera that is consistent with at least some aspects of some embodimentsof the present disclosure;

FIG. 55 is a schematic illustrating an exemplary CA screen shot and acamera that tracks a CA's eyes that is consistent with at least someaspects of some embodiments of the present disclosure;

FIG. 56 is a screen shot showing an AU interface where a first errorcorrection is shown distinguished in multiple ways;

FIG. 57 is a screen shot similar to FIG. 56, albeit where the firsterror correction is shown in a less noticeable way and a second errorcorrection is shown distinguished in multiple ways so that thedistinguishing effect related to the first error correction appears tobe extinguishing;

FIG. 58 is similar to FIGS. 56 and 57, albeit showing the interfaceafter a third error correction is presented where the first errorcorrection is now shown as normal text, the second is showndistinguished in an extinguishing fashion and the third error correctionis fully distinguished;

FIG. 59 includes a flowchart that shows a process that is consistentwith at least some aspects of the present disclosure;

FIG. 60 includes a flowchart that shows a process that is consistentwith at least some aspects of the present disclosure;

FIG. 61 is a schematic diagram that illustrates many combinations ofsystem components that may cooperate in many different ways to providecaptioning services to an assisted user;

FIG. 62 shows a captioned device display screen including an eyetracking camera that is consistent with at least some aspects of thepresent disclosure;

FIG. 63 includes a flowchart that shows a process that is consistentwith at least some aspects of the present disclosure;

FIG. 63A includes a flowchart of a process that is consistent with atleast some aspects of the present disclosure;

FIG. 64 is a captioning screen shot that is consistent with at leastsome aspects of the present disclosure;

FIG. 65 is a captioning screen shot that is consistent with at leastsome aspects of the present disclosure;

FIG. 66 is a captioning screen shot showing one view of a teleconferencecommunication captioning system that is consistent with at least someaspects of the present disclosure;

FIG. 67 is similar to FIG. 66, albeit showing another captioningteleconference view;

FIG. 68 is similar to FIG. 66, albeit showing another captioningteleconference view; and

FIG. 69 includes a flowchart that shows a process that is consistentwith at least some aspects of the present disclosure.

While the disclosure is susceptible to various modifications andalternative forms, specific embodiments thereof have been shown by wayof example in the drawings and are herein described in detail. It shouldbe understood, however, that the description herein of specificembodiments is not intended to limit the disclosure to the particularforms disclosed, but on the contrary, the intention is to cover allmodifications, equivalents, and alternatives falling within the spiritand scope of the disclosure as defined by the appended claims.

DETAILED DESCRIPTION OF THE DISCLOSURE

The various aspects of the subject disclosure are now described withreference to the annexed drawings, wherein like reference numeralscorrespond to similar elements throughout the several views. It shouldbe understood, however, that the drawings and detailed descriptionhereafter relating thereto are not intended to limit the claimed subjectmatter to the particular form disclosed. Rather, the intention is tocover all modifications, equivalents, and alternatives falling withinthe spirit and scope of the claimed subject matter.

As used herein, the terms “component,” “system” and the like areintended to refer to a computer-related entity, either hardware, acombination of hardware and software, software, or software inexecution. For example, a component may be, but is not limited to being,a process running on a processor, a processor, an object, an executable,a thread of execution, a program, and/or a computer. By way ofillustration, both an application running on a computer and the computercan be a component. One or more components may reside within a processand/or thread of execution and a component may be localized on onecomputer and/or distributed between two or more computers or processors.

The word “exemplary” is used herein to mean serving as an example,instance, or illustration. Any aspect or design described herein as“exemplary” is not necessarily to be construed as preferred oradvantageous over other aspects or designs.

Furthermore, the disclosed subject matter may be implemented as asystem, method, apparatus, or article of manufacture using standardprogramming and/or engineering techniques to produce software, firmware,hardware, or any combination thereof to control a computer or processorbased device to implement aspects detailed herein. The term “article ofmanufacture” (or alternatively, “computer program product”) as usedherein is intended to encompass a computer program accessible from anycomputer-readable device, carrier, or media. For example, computerreadable media can include but are not limited to magnetic storagedevices (e.g., hard disk, floppy disk, magnetic strips . . . ), opticaldisks (e.g., compact disk (CD), digital versatile disk (DVD) . . . ),smart cards, solid state drives and flash memory devices (e.g., card,stick). Additionally it should be appreciated that a carrier wave can beemployed to carry computer-readable electronic data such as those usedin transmitting and receiving electronic mail or in accessing a networksuch as the Internet or a local area network (LAN). Of course, thoseskilled in the art will recognize many modifications may be made to thisconfiguration without departing from the scope or spirit of the claimedsubject matter.

Unless indicates otherwise, the phrases “assisted user”, “hearing user”and “call assistant” will be represented by the acronyms “AU”, “HU” and“CA”, respectively. The acronym “ASR” will be used to abbreviate thephrase “automatic speech recognition”. Unless indicated otherwise, thephrase “full CA mode” will be used to refer to a call captioning systeminstantaneously generating captions for at least a portion of acommunication session wherein a voice signal is listened to by a live CA(e.g., a person) who transcribes the voice message to text which the CAthen corrects where the CA generated text is presented to at least oneof the communicants to the communication session and the phrase “ASR-CAbacked up mode” will be used to refer to a call captioning systeminstantaneously generating captions for at least a portion of acommunication session where a voice signal is fed to an ASR softwareengine (e.g., a computer running software) that generates at leastinitial captions for the received voice signal and where a CA correctsthe original captions where the ASR generated captions and in at leastsome cases the CA generated corrections are presented to at least one ofthe communicants to the communication session.

System Architecture

Referring now to the drawings wherein like reference numerals correspondto similar elements throughout the several views and, more specifically,referring to FIG. 1, the present disclosure will be described in thecontext of an exemplary communication system 10 including an AU'scommunication device 12, an HU's telephone or other type communicationdevice 14, and a relay 16. The AU's device 12 is linked to the HU'sdevice 14 via any network connection capable of facilitating a voicecall between the AU and the HU. For instance, the link may be aconventional telephone line, a network connection such as an internetconnection or other network connection, a wireless connection, etc. AUdevice 12 includes a keyboard 20, a display screen 18 and a handset 22.Keyboard 20 can be used to dial any telephone number to initiate a calland, in at least some cases, includes other keys or may be controlled topresent virtual buttons via screen 18 for controlling various functionsthat will be described in greater detail below. Other identifiers suchas IP addresses or the like may also be used in at least some cases toinitiate a call. Screen 18 includes a flat panel display screen fordisplaying, among other things, text transcribed from a voice message orsignal generated using HU's device 14, control icons or buttons, captionfeedback signals, etc. Handset 22 includes a speaker for broadcasting aHU's voice messages to an AU and a microphone for receiving a voicemessage from an AU for delivery to the HU's device 14. AU device 12 mayalso include a second loud speaker so that device 12 can operate as aspeaker phone type device. Although not shown, device 12 furtherincludes a processor and a memory for storing software run by theprocessor to perform various functions that are consistent with at leastsome aspects of the present disclosure. Device 12 is also linked or islinkable to relay 16 via any communication network including a phonenetwork, a wireless network, the internet or some other similar network,etc. Device 12 may further include a Bluetooth or other type oftransmitter for linking to an AU's hear aide or some other speaker typedevice.

HU's device 14, in at least some embodiments, includes a communicationdevice (e.g., a telephone) including a keyboard for dialing phonenumbers and a handset including a speaker and a microphone forcommunication with other devices. In other embodiments device 14 mayinclude a computer, a smart phone, a smart tablet, etc., that canfacilitate audio communications with other devices. Devices 12 and 14may use any of several different communication protocols includinganalog or digital protocols, a VOIP protocol or others.

Referring still to FIG. 1, relay 16 includes, among other things, arelay server 30 and a plurality of CA work stations 32, 34, etc. Each ofthe CA work stations 32, 34, etc., is similar and operates in a similarfashion and therefore only station 32 is described here in any detail.Station 32 includes a display screen 50, a keyboard 52 and aheadphone/microphone headset 54. Screen 50 may be any type of electronicdisplay screen for presenting information including text transcribedfrom a HU's voice signal or message. In most cases screen 50 willpresent a graphical user interface with on screen tools for editing textthat appears on the screen. One text editing system is described in U.S.Pat. No. 7,164,753 which issued on Jan. 16, 2007 which is titled “RealTime Transcription Correction System” and which is incorporated hereinin its entirety.

Keyboard 52 is a standard text entry QUERTY type keyboard and can beused to type text or to correct text presented on displays screen 50.Headset 54 includes a speaker in an ear piece and a microphone in amouth piece and is worn by a CA. The headset enables a CA to listen tothe voice of a HU and the microphone enables the CA to speak voicemessages into the relay system such as, for instance, revoiced messagesfrom a HU to be transcribed into text. For instance, typically during acall between a HU on device 14 and an AU on device 12, the HU's voicemessages are presented to a CA via headset 54 and the CA revoices themessages into the relay system using headset 54. Software trained to thevoice of the CA transcribes the assistant's voice messages into textwhich is presented on display screen 50. The CA then uses keyboard 52and/or headset 54 to make corrections to the text on display 50. Thecorrected text is then transmitted to the AU's device 12 for display onscreen 18. In the alternative, the text may be transmitted prior tocorrection to the AU's device 12 for display and corrections may besubsequently transmitted to correct the displayed text via in-linecorrections where errors are replaced by corrected text.

Although not shown, CA work station 32 may also include a foot pedal orother device for controlling the speed with which voice messages areplayed via headset 54 so that the CA can slow or even stop play of themessages while the assistant either catches up on transcription orcorrection of text.

Referring still to FIG. 1 and also to FIG. 2, server 30 is a computersystem that includes, among other components, at least a first processor56 linked to a memory or database 58 where software run by processor 56to facilitate various functions that are consistent with at least someaspects of the present disclosure is stored. The software stored inmemory 58 includes pre-trained CA voice-to-text transcription software60 for each CA where CA specific software is trained to the voice of anassociated CA thereby increasing the accuracy of transcriptionactivities. For instance, Naturally Speaking continuous speechrecognition software by Dragon, Inc. may be pre-trained to the voice ofa specific CA and then used to transcribe voice messages voiced by theCA into text.

In addition to the CA trained software, a voice-to-text software program62 that is not pre-trained to a CA's voice and instead that trains toany voice on the fly as voice messages are received is stored in memory58. Again, Naturally Speaking software that can train on the fly may beused for this purpose. Hereinafter, the automatic speech recognitionsoftware or system that trains to the HU voices will be referred togenerally as an ASR engine at times.

Moreover, software 64 that automatically performs one of severaldifferent types of triage processes to generate text from voice messagesaccurately, quickly and in a relatively cost effective manner is storedin memory 58. The triage programs are described in detail hereafter.

One issue with existing relay systems is that each call is relativelyexpensive to facilitate. To this end, in order to meet required accuracystandards for text caption calls, each call requires a dedicated CA.While automated voice-to-text systems that would not require a CA havebeen contemplated, none has been successfully implemented because ofaccuracy and speed problems.

Basic Semi-Automated System

One aspect of the present disclosure is related to a system that issemi-automated wherein a CA is used when accuracy of an automated systemis not at required levels and the assistant is cut out of a callautomatically or manually when accuracy of the automated system meets orexceeds accuracy standards or at the preference of an AU. For instance,in at least some cases a CA will be assigned to every new call linked toa relay and the CA will transcribe voice-to-text as in an existingsystem. Here, however, the difference will be that, during the call, thevoice of a HU will also be processed by server 30 to automaticallytranscribe the HU's voice messages to text (e.g., into “automatedtext”). Server 30 compares corrected text generated by the CA to theautomated text to identify errors in the automated text. Server 30 usesidentified errors to train the automated voice-to-text software to thevoice of the HU. During the beginning of the call the software trains tothe HU's voice and accuracy increases over time as the software trains.At some point the accuracy increases until required accuracy standardsare met. Once accuracy standards are met, server 30 is programmed toautomatically cut out the CA and start transmitting the automated textto the AU's device 12.

In at least some cases, when a CA is cut out of a call, the system mayprovide a “Help” button, an “Assist” button or “Assistance Request” typebutton (see 68 in FIG. 1) to an AU so that, if the AU recognizes thatthe automated text has too many errors for some reason, the AU canrequest a link to a CA to increase transcription accuracy (e.g.,generate an assistance request). In some cases the help button may be apersistent mechanical button on the AU's device 12. In the alternative,the help button may be a virtual on screen icon (e.g., see 68 in FIG. 1)and screen 18 may be a touch sensitive screen so that contact with thevirtual button can be sensed. Where the help button is virtual, thebutton may only be presented after the system switches from providing CAgenerated text to an AU's device to providing automated text to the AU'sdevice to avoid confusion (e.g., avoid a case where an AU is alreadyreceiving CA generated text but thinks, because of a help button, thateven better accuracy can be achieved in some fashion). Thus, while CAgenerated text is displayed on an AU's device 12, no “help” button ispresented and after automated text is presented, the “help” button ispresented. After the help button is selected and a CA is re-linked tothe call, the help button is again removed from the AU's device display18 to avoid confusion.

Referring now to FIGS. 2 and 3, a method or process 70 is illustratedthat may be performed by server 30 to cut out a CA when automated textreaches an accuracy level that meets a standard threshold level.Referring also to FIG. 1, at block 72, help and auto flags are each setto a zero value. The help flag indicates that an AU has selected a helpor assist button via the AU's device 12 because of a perception that toomany errors are occurring in transcribed text. The auto flag indicatesthat automated text accuracy has exceeded a standard thresholdrequirement. Zero values indicate that the help button has not beenselected and that the standard requirement has yet to be met and onevalues indicate that the button has been selected and that the standardrequirement has been met.

Referring still to FIGS. 1 and 3, at block 74, during a phone callbetween a HU using device 14 and an AU using device 12, the HU's voicemessages are transmitted to server 30 at relay 16. Upon receiving theHU's voice messages, server 30 checks the auto and help flags at blocks76 and 84, respectively. At least initially the auto flag will be set tozero at block 76 meaning that automated text has not reached theaccuracy standard requirement and therefore control passes down to block78 where the HU's voice messages are provided to a CA. At block 80, theCA listens to the HU's voice messages and generates text correspondingthereto by either typing the messages, revoicing the messages tovoice-to-text transcription software trained to the CA's voice, or acombination of both. Text generated is presented on screen 50 and the CAmakes corrections to the text using keyboard 52 and/or headset 54 atblock 80. At block 82 the CA generated text is transmitted to AU device12 to be displayed for the AU on screen 18.

Referring again to FIGS. 1 and 3, at block 84, at least initially thehelp flag will be set to zero indicating that the AU has not requestedadditional captioning assistance. In fact, at least initially the “help”button 68 may not be presented to an AU as CA generated text isinitially presented. Where the help flag is zero at block 84, controlpasses to block 86 where the HU's voice messages are fed tovoice-to-text software run by server 30 that has not been previouslytrained to any particular voice. At block 88 the software automaticallyconverts the HU's voice-to-text generating automated text. At block 90,server 30 compares the CA generated text to the automated text toidentify errors in the automated text. At block 92, server 30 uses theerrors to train the voice-to-text software for the HU's voice. In thisregard, for instance, where an error is identified, server 30 modifiesthe software so that the next time the utterance that resulted in theerror occurs, the software will generate the word or words that the CAgenerated for the utterance. Other ways of altering or training thevoice-to-text software are well known in the art and any way of trainingthe software may be used at block 92.

After block 92 control passes to block 94 where server 30 monitors for aselection of the “help” button 68 by the AU. If the help button has notbeen selected, control passes to block 96 where server 30 compares theaccuracy of the automated text to a threshold standard accuracyrequirement. For instance, the standard requirement may require thataccuracy be great than 96% measured over at least a most recentforty-five second period or a most recent 100 words uttered by a HU,whichever is longer. Where accuracy is below the threshold requirement,control passes back up to block 74 where the process described abovecontinues. At block 96, once the accuracy is greater than the thresholdrequirement, control passes to block 98 where the auto flag is set toone indicating that the system should start using the automated text anddelink the CA from the call to free up the assistant to handle adifferent call. A virtual “help” button may also be presented via theAU's display 18 at this time. Next, at block 100, the CA is delinkedfrom the call and at block 102 the processor generated automated text istransmitted to the AU device to be presented on display screen 18.

Referring again to block 74, the HU's voice is continually receivedduring a call and at block 76, once the auto flag has been set to one,the lower portion of the left hand loop including blocks 78, 80 and 82is cut out of the process as control loops back up to block 74.

Referring again to block 94, if, during an automated portion of a callwhen automated text is being presented to the AU, the AU decides thatthere are too many errors in the transcription presented via display 18and the AU selects the “help” button 68 (see again FIG. 1), controlpasses to block 104 where the help flag is set to one indicating thatthe AU has requested the assistance of a CA and the auto flag is resetto zero indicating that CA generated text will be used to drive the AU'sdisplay 18 instead of the automated text. Thereafter control passes backup to block 74. Again, at block 76, with the auto flag set to zero thenext time through decision block 76, control passes back down to block78 where the call is again linked to a CA for transcription as describedabove. In addition, the next time through block 84, because the helpflag is set to one, control passes back up to block 74 and the automatedtext loop including blocks 86 through 104 is effectively cut out of therest of the call.

In at least some embodiments, there will be a short delay (e.g., 5 to 10seconds in most cases) between setting the flags at block 104 andstopping use of the automated text so that a new CA can be linked up tothe call and start generating CA generated text prior to halting theautomated text. In these cases, until the CA is linked and generatingtext for at least a few seconds (e.g., 3 seconds), the automated textwill still be used to drive the AU's display 18. The delay may either bea pre-defined delay or may have a case specific duration that isdetermined by server 30 monitoring CA generated text and switching overto the CA generated text once the CA is up to speed.

In some embodiments, prior to delinking a CA from a call at block 100,server 30 may store a CA identifier along with a call identifier for thecall. Thereafter, if an AU requests help at block 94, server 30 may beprogrammed to identify if the CA previously associated with the call isavailable (e.g. not handling another call) and, if so, may re-link tothe CA at block 78. In this manner, if possible, a CA that has at leastsome context for the call can be linked up to restart transcriptionservices.

In some embodiments it is contemplated that after an AU has selected ahelp button to receive call assistance, the call will be completed witha CA on the line. In other cases it is contemplated that server 30 may,when a CA is re-linked to a call, start a second triage process toattempt to delink the CA a second time if a threshold accuracy level isagain achieved. For instance, in some cases, midstream during a call, asecond HU may start communicating with the AU via the HU's device. Forinstance, a child may yield the HU's device 14 to a grandchild that hasa different voice profile causing the AU to request help from a CAbecause of perceived text errors. Here, after the hand back to the CA,server 30 may start training on the grandchild's voice and mayeventually achieve the threshold level required. Once the thresholdagain occurs, the CA may be delinked a second time so that automatedtext is again fed to the AU's device.

As another example text errors in automated text may be caused bytemporary noise in one or more of the lines carrying the HU's voicemessages to relay 16. Here, once the noise clears up, automated text mayagain be a suitable option. Thus, here, after an AU requests CA help,the triage process may again commence and if the threshold accuracylevel is again exceeded, the CA may be delinked and the automated textmay again be used to drive the AU's device 12. While the thresholdaccuracy level may be the same each time through the triage process, inat least some embodiments the accuracy level may be changed each timethrough the process. For instance, the first time through the triageprocess the accuracy threshold may be 96%. The second time through thetriage process the accuracy threshold may be raised to 98%.

In at least some embodiments, when the automated text accuracy exceedsthe standard accuracy threshold, there may be a short transition timeduring which a CA on a call observes automated text while listening to aHU's voice message to manually confirm that the handover from CAgenerated text to automated text is smooth. During this short transitiontime, for instance, the CA may watch the automated text on herworkstation screen 50 and may correct any errors that occur during thetransition. In at least some cases, if the CA perceives that the handoffdoes not work or the quality of the automated text is poor for somereason, the CA may opt to retake control of the transcription process.

One sub-process 120 that may be added to the process shown in FIG. 3 formanaging a CA to automated text handoff is illustrated in FIG. 4.Referring also to FIGS. 1 and 2, at block 96 in FIG. 3, if the accuracyof the automated text exceeds the accuracy standard threshold level,control may pass to block 122 in FIG. 4. At block 122, a short durationtransition timer (e.g. 10-15 seconds) is started. At block 124 automatedtext (e.g., text generated by feeding the HU's voice messages directlyto voice-to-text software) is presented on the CA's display 50. At block126 an on screen “Retain Control” icon or virtual button is provided tothe CA via the assistant's display screen 50 which can be selected bythe CA to forego the handoff to the automated voice-to-text software. Atblock 128, if the “Retain Control” icon is selected, control passes toblock 132 where the help flag is set to one and then control passes backup to block 76 in FIG. 3 where the CA process for generating textcontinues as described above. At block 128, if the CA does not selectthe “Retain Control” icon, control passes to block 130 where thetransition timer is checked. If the transition timer has not timed outcontrol passes back up to block 124. Once the timer times out at block130, control passes back to block 98 in FIG. 3 where the auto flag isset to one and the CA is delinked from the call.

In at least some embodiments it is contemplated that after voice-to-textsoftware takes over the transcription task and the CA is delinked from acall, server 30 itself may be programmed to sense when transcriptionaccuracy has degraded substantially and the server 30 may cause are-link to a CA to increase accuracy of the text transcription. Forinstance, server 30 may assign a confidence factor to each word in theautomated text based on how confident the server is that the word hasbeen accurately transcribed. The confidence factors over a most recentnumber of words (e.g., 100) or a most recent period (e.g., 45 seconds)may be averaged and the average used to assess an overall confidencefactor for transcription accuracy. Where the confidence factor is belowa threshold level, server 30 may re-link to a CA to increasetranscription accuracy. The automated process for re-linking to a CA maybe used instead of or in addition to the process described above wherebyan AU selects the “help” button to re-link to a CA.

In at least some cases when an AU selects a “help” button to re-link toa CA, partial call assistance may be provided instead of full CAservice. For instance, instead of adding a CA that transcribes a HU'svoice messages and then corrects errors, a CA may be linked only forcorrection purposes. The idea here is that while software trained to aHU's voice may generate some errors, the number of errors after trainingwill still be relatively small in most cases even if objectionable to anAU. In at least some cases CAs may be trained to have different skillsets where highly skilled and relatively more expensive to retain CAsare trained to re-voice HU voice messages and correct the resulting textand less skilled CAs are trained to simply make corrections to automatedtext. Here, initially all calls may be routed to highly skilledrevoicing or “transcribing” CAs and all re-linked calls may be routed toless skilled “corrector” CAs.

A sub-process 134 that may be added to the process of FIG. 3 for routingre-linked calls to a corrector CA is shown in FIG. 5. Referring also toFIGS. 1 and 3, at decision block 94, if an AU selects the help button,control may pass to block 136 in FIG. 3 where the call is linked to asecond corrector CA. At block 138 the automated text is presented to thesecond CA via the CA's display 50. At block 140 the second CA listens tothe voice of the HU and observes the automated text and makescorrections to errors perceived in the text. At block 142, server 30transmits the corrected automated text to the AU's device for displayvia screen 18. After block 142 control passes back up to block 76 inFIG. 2.

Re-Sync and Fill in Text

In some cases where a CA generates text that drives an AU's displayscreen 18 (see again FIG. 1), for one reason or another the CA'stranscription to text may fall behind the HU's voice message stream by asubstantial amount. For instance, where a HU is speaking quickly, isusing odd vocabulary, and/or has an unusual accent that is hard tounderstand, CA transcription may fall behind a voice message stream by20 seconds, 40 seconds or more.

In many cases when captioning falls behind, an AU can perceive thatpresented text has fallen far behind broadcast voice messages from a HUbased on memory of recently broadcast voice message content and observedtext. For instance, an AU may recognize that currently displayed textcorresponds to a portion of the broadcast voice message that occurredthirty seconds ago. In other cases some captioning delay indicator maybe presented via an AU's device display 18. For instance, see FIG. 17where captioning delay is indicated in two different ways on a displayscreen 18. First, text 212 indicates an estimated delay in seconds(e.g., 24 second delay). Second, at the end of already transcribed text214, blanks 216 for words already voiced but yet to be transcribed maybe presented to give an AU a sense of how delayed the captioning processhas become.

When an AU perceives that captioning is too far behind or when the usercannot understand a recently broadcast voice message, the AU may wantthe text captioning to skip ahead to the currently broadcast voicemessage. For instance, if an AU had difficulty hearing the most recentfive seconds of a HU's voice message and continues to have difficultyhearing but generally understood the preceding 25 seconds, the AU maywant the captioning process to be re-synced with the current HU's voicemessage so that the AU's understanding of current words is accurate.

Here, however, because the AU could not understand the most recent 5seconds of broadcast voice message, a re-sync with the current voicemessage would leave the AU with at least some void in understanding theconversation (e.g., at least the most recent 5 seconds of misunderstoodvoice message would be lost). To deal with this issue, in at least someembodiments, it is contemplated that server 30 may run automatedvoice-to-text software on a HU's voice message simultaneously with a CAgenerating text from the voice message and, when an AU requests a“catch-up” or “re-sync” of the transcription process to the currentvoice message, server 30 may provide “fill in” automated textcorresponding to the portion of the voice message between the mostrecent CA generated text and the instantaneous voice message which maybe provided to the AU's device for display and also, optionally, to theCA's display screen to maintain context for the CA. In this case, whilethe fill in automated text may have some errors, the fill in text willbe better than no text for the associated period and can be referred toby the AU to better understand the voice messages.

In cases where the fill in text is presented on the CA's display screen,the CA may correct any errors in the fill in text. This correction andany error correction by a CA for that matter may be made prior totransmitting text to the AU's device or subsequent thereto. Wherecorrected text is transmitted to an AU's device subsequent totransmission of the original error prone text, the AU's device correctsthe errors by replacing the erroneous text with the corrected text.

Because it is often the case that AUs will request a re-sync only whenthey have difficulty understanding words, server 30 may only presentautomated fill in text to an AU corresponding to a pre-defined durationperiod (e.g., 8 seconds) that precedes the time when the re-sync requestoccurs. For instance, consistent with the example above where CAcaptioning falls behind by thirty seconds, an AU may only requestre-sync at the end of the most recent five seconds as inability tounderstand the voice message may only be an issue during those fiveseconds. By presenting the most recent eight seconds of automated textto the AU, the user will have the chance to read text corresponding tothe misunderstood voice message without being inundated with a largesegment of automated text to view. Where automated fill in text isprovided to an AU for only a pre-defined duration period, the same textmay be provided for correction to the CA.

Referring now to FIG. 7, a method 190 by which an AU requests a re-syncof the transcription process to current voice messages when CA generatedtext falls behind current voice messages is illustrated. Referring alsoto FIG. 1, at block 192 a HU's voice messages are received at relay 16.After block 192, control passes down to each of blocks 194 and 200 wheretwo simultaneous sub-processes occur in parallel. At block 194, the HU'svoice messages are stored in a rolling buffer. The rolling buffer may,for instance, have a two minute duration so that the most recent twominutes of a HU's voice messages are always stored. At block 196, a CAlistens to the HU's voice message and transcribes text corresponding tothe messages via re-voicing to software trained to the CA's voice,typing, etc. At block 198 the CA generated text is transmitted to AU'sdevice 12 to be presented on display screen 18 after which controlpasses back up to block 192. Text correction may occur at block 196 orafter block 198.

Referring again to FIG. 7, at process block 200, the HU's voice is feddirectly to voice-to-text software run by server 30 which generatesautomated text at block 202. Although not shown in FIG. 7, after block202, server 30 may compare the automated text to the CA generated textto identify errors and may use those errors to train the software to theHU's voice so that the automated text continues to get more accurate asa call proceeds.

Referring still to FIGS. 1 and 7, at decision block 204, controller 30monitors for a catch up or re-sync command received via the AU's device12 (e.g., via selection of an on-screen virtual “catch up” button 220,see again FIG. 17). Where no catch up or re-sync command has beenreceived, control passes back up to block 192 where the processdescribed above continues to cycle. At block 204, once a re-sync commandhas been received, control passes to block 206 where the buffered voicemessages are skipped and a current voice message is presented to the earof the CA to be transcribed. At block 208 the automated textcorresponding to the skipped voice message segment is filled in to thetext on the CA's screen for context and at block 210 the fill in text istransmitted to the AU's device for display. Here, where an ASR enginecontinues to correct text errors within the filled in text, those errorsmay be automatically corrected on the AU device display, the CA stationdisplay or both.

In other embodiments, an AU device processor may monitor AU voicesignals for AU control commands such as a “update” command and may usethat command as a trigger to fill in delayed text with ASR text and toskip a CA ahead to a current HU voice signal. Here, the idea is that theAU device processor may be programmed to recognize one or a small numberof verbal commands for controlling a captioning process. In at leastsome cases, when a verbal control command is received, the processor mayfilter that AU voice signal control command out of the signaltransmitted to the HU device and may consume that command to skip thecaptioning process ahead.

Where automated text is filled in upon the occurrence of a catch upprocess, the fill in text may be visually distinguished on the AU'sscreen and/or on the CA's screen. For instance, fill in text may behighlighted, underlined, bolded, shown in a distinct font, etc. Forexample, see FIG. 18 that shows fill in text 222 that is underlined tovisually distinguish. See also that the captioning delay 212 has beenupdated. In some cases, fill in text corresponding to voice messagesthat occur after or within some pre-defined period prior to a re-syncrequest may be distinguished in yet a third way to point out the textcorresponding to the portion of a voice message that the AU most likelyfound interesting (e.g., the portion that prompted selection of there-sync button). For instance, where 24 previous seconds of text arefilled in when a re-sync request is initiated, all 24 seconds of fill intext may be underlined and the 8 seconds of text prior to the re-syncrequest may also be highlighted in yellow. See in FIG. 18 that some ofthe fill in text is shown in a phantom box 226 to indicate highlighting.

In at least some cases it is contemplated that server 30 may beprogrammed to automatically determine when CA generated textsubstantially lags a current voice message from a HU and server 30 mayautomatically skip ahead to re-sync a CA with a current message whileproviding automated fill in text corresponding to intervening voicemessages. For instance, server 30 may recognize when CA generated textis more than thirty seconds behind a current voice message and may skipthe voice messages ahead to the current message while filling inautomated text to fill the gap. In at least some cases this automatedskip ahead process may only occur after at least some (e.g., 2 minutes)training to a HU's voice so ensure that minimal errors are generated inthe fill in text.

A method 150 for automatically skipping to a current voice message in abuffer when a CA falls to far behind is shown in FIG. 6. Referring alsoto FIG. 1, at block 152, a HU's voice messages are received at relay 16.After block 152, control passes down to each of blocks 154 and 162 wheretwo simultaneous sub-processes occur in parallel. At block 154, the HU'svoice messages are stored in a rolling buffer. At block 156, a CAlistens to the HU's voice message and transcribes text corresponding tothe messages via re-voicing to software trained to the CA's voice,typing, etc., after which control passes to block 170.

Referring still to FIG. 6, at process block 162, the HU's voice is feddirectly to voice-to-text software run by server 30 which generatesautomated text at block 164. Although not shown in FIG. 6, after block164, server 30 may compare the automated text to the CA generated textto identify errors and may use those errors to train the software to theHU's voice so that the automated text continues to get more accurate asa call proceeds.

Referring still to FIGS. 1 and 6, at decision block 166, controller 30monitors how far CA text transcription is behind the current voicemessage and compares that value to a threshold value. If the delay isless than the threshold value, control passes down to block 170. If thedelay exceeds the threshold value, control passes to block 168 whereserver 30 uses automated text from block 164 to fill in the CA generatedtext and skips the CA up to the current voice message. After block 168control passes to block 170. At block 170, the text including the CAgenerated text and the fill in text is presented to the CA via displayscreen 50 and the CA makes any corrections to observed errors. At block172, the text is transmitted to AU's device 12 and is displayed onscreen 18. Again, uncorrected text may be transmitted to and displayedon device 12 and corrected text may be subsequently transmitted and usedto correct errors in the prior text in line on device 12. After block172 control passes back up to block 152 where the process describedabove continues to cycle. Automatically generated text to fill in whenskipping forward may be visually distinguished (e.g., highlighted,underlined, etc.)

In at least some cases when automated fill in text is generated, thattext may not be presented to the CA or the AU as a single block andinstead may be doled out at a higher speed than the talking speed of theHU until the text catches up with a current time. To this end, wheretranscription is far behind a current point in a conversation, ifautomated catch up text were generated as an immediate single block, inat least some cases, the earliest text in the block could shoot off aCA's display screen or an AU's display screen so that the CA or the AUwould be unable to view all of the automated catch up text. Instead ofpresenting the automated text as a complete block upon catchup, theautomated catch up text may be presented at a rate that is faster (e.g.,two to three times faster) than the HU's rate of speaking so that catchup is rapid without the oldest catch up text running off the CA's orAU's displays.

In addition to avoiding a case where text shoots off an AU's displayscreen, presenting text in a constant but rapid flow has a better feelto it as the text is not presented in a jerky start and stop fashionwhich can be distracting to an AU trying to follow along as text ispresented.

In other cases, when an AU requests fill in, the system mayautomatically fill in text and only present the most recent 10 secondsor so of the automatic fill in text to the CA for correction so that theAU has corrected text corresponding to a most recent period as quicklyas possible. In many cases where the CA generated text is substantiallydelayed, much of the fill in text would run off a typical AU's devicedisplay screen when presented so making corrections to that text wouldmake little sense as the AU that requests catch up text is typicallymost interested in text associated with the most recent HU voice signal.

Many AU's devices can be used as conventional telephones withoutcaptioning service or as AU devices where captioning is presented andvoice messages are broadcast to an AU. The idea here is that one devicecan be used by hearing impaired persons and persons that have no hearingimpairment and that the overall costs associated with providingcaptioning service can be minimized by only using captioning whennecessary. In many cases even a hearing impaired person may not needcaptioning service all of the time. For instance, a hearing impairedperson may be able to hear the voice of a person that speaks loudlyfairly well but may not be able to hear the voice of another person thatspeaks more softly. In this case, captioning would be required whenspeaking to the person with the soft voice but may not be required whenspeaking to the person with the loud voice. As another instance, animpaired person may hear better when well rested but hear relativelymore poorly when tired so captioning is required only when the person istired. As still another instance, an impaired person may hear well whenthere is minimal noise on a line but may hear poorly if line noiseexceeds some threshold. Again, the impaired person would only needcaptioning some of the time.

To minimize captioning service costs and still enable an impaired personto obtain captioning service whenever needed and even during an ongoingcall, some systems start out all calls with a default setting where anAU's device 12 is used like a normal telephone without captioning. Atany time during an ongoing call, an AU can select either a mechanical orvirtual “Caption” icon or button (see again 68 in FIG. 1) to link thecall to a relay, provide a HU's voice messages to the relay and commencecaptioning service. One problem with starting captioning only after anAU experiences problems hearing words is that at least some words (e.g.,words that prompted the AU to select the caption button in the firstplace) typically go unrecognized and therefore the AU is left with avoid in their understanding of a conversation.

One solution to the problem of lost meaning when words are notunderstood just prior to selection of a caption button is to store arolling recordation of a HU's voice messages that can be transcribedsubsequently when the caption button is selected to generate “fill in”text. For instance, the most recent 20 seconds of a HU's voice messagesmay be recorded and then transcribed only if the caption button isselected. The relay generates text for the recorded message eitherautomatically via software or via revoicing or typing by a CA or via acombination of both. In addition, the CA or the automated voicerecognition software starts transcribing current voice messages. Thetext from the recording and the real time messages is transmitted to andpresented via AU's device 12 which should enable the AU to determine themeaning of the previously misunderstood words. In at least someembodiments the rolling recordation of HU's voice messages may bemaintained by the AU's device 12 (see again FIG. 1) and that recordationmay be sent to the relay for immediate transcription upon selection ofthe caption button.

Referring now to FIG. 8, a process 230 that may be performed by thesystem of FIG. 1 to provide captioning for voice messages that occurprior to a request for captioning service is illustrated. Referring alsoto FIG. 1, at block 232 a HU's voice messages are received during a callwith an AU at the AU's device 12. At block 234 the AU's device 12 storesa most recent 20 seconds of the HU's voice messages on a rolling basis.The 20 seconds of voice messages are stored without captioning initiallyin at least some embodiments. At decision block 236, the AU's devicemonitors for selection of a captioning button (not shown). If thecaptioning button has not been selected, control passes back up to block232 where blocks 232, 234 and 236 continue to cycle.

Once the caption button has been selected, control passes to block 238where AU's device 12 establishes a communication link to relay 16. Atblock 240 AU's device 12 transmits the stored 20 seconds of the HU'svoice messages along with current ongoing voice messages from the HU torelay 16. At this point a CA and/or software at the relay transcribesthe voice-to-text, corrections are made (or not), and the text istransmitted back to device 12 to be displayed. At block 242 AU's device12 receives the captioned text from the relay 16 and at block 244 thereceived text is displayed or presented on the AU's device display 18.At block 246, in at least some embodiments, text corresponding to the 20seconds of HU voice messages prior to selection of the caption buttonmay be visually distinguished (e.g., highlighted, bolded, underlined,etc.) from other text in some fashion. After block 246 control passesback up to block 232 where the process described above continues tocycle and captioning in substantially real time continues.

Referring to FIG. 9, a relay server process 270 whereby automatedsoftware transcribes voice messages that occur prior to selection of acaption button and a CA at least initially captions current voicemessages is illustrated. At block 272, after an AU requests captioningservice by selecting a caption button, server 30 receives a HU's voicemessages including current ongoing messages as well as the most recent20 seconds of voice messages that had been stored by AU's device 12 (seeagain FIG. 1). After block 27, control passes to each of blocks 274 and278 where two simultaneous processes commence in parallel. At block 274the stored 20 seconds of voice messages are provided to voice-to-textsoftware run by server 30 to generate automated text and at block 276the automated text is transmitted to the AU's device 12 for display. Atblock 278 the current or real time HU's voice messages are provided to aCA and at block 280 the CA transcribes the current voice messages totext. The CA generated text is transmitted to an AU's device at block282 where the text is displayed along with the text transmitted at block276. Thus, here, the AU receives text corresponding to misunderstoodvoice messages that occur just prior to the AU requesting captioning.One other advantage of this system is that when captioning starts, theCA is not starting captioning with an already existing backlog of wordsto transcribe and instead automated software is used to provide theprior text.

In other embodiments, when an AU cannot understand a voice messageduring a normal call and selects a caption button to obtain captioningfor a most recent segment of a HU's voice signal, the system may simplyprovide captions for the most recent 10-20 seconds of the voice signalwithout initiating ongoing automatic or assistance from a CA. Thus,where an AU is only sporadically or periodically unable to hear andunderstand the broadcast HU's voice, the HU may select the captionbutton to obtain periodic captioning when needed. For instance, it isenvisioned that in one case, an AU may participate in a five minute calland may only require captioning during three short 20 second periods. Inthis case, the AU would select the caption button three times, once foreach time that the user is unable to hear the HU's voice signal, and thesystem would generate three bursts of text, one for each of three HUvoice segments just prior to each of the button activation events.

In some cases instead of just presenting captioning for the 20 secondsprior to a caption button activation event, the system may present theprior 20 seconds and a few seconds (e.g. 10) of captioning just afterthe button selection to provide the 20 prior seconds in some context tomake it easier for the AU to understand the overall text.

Third Party Automated Speech Recognition (ASR) and Other ASR Resources

In addition to using a service provided by relay 16 to transcribe storedrolling text, other resources may be used to transcribe the storedrolling text. For instance, in at least some embodiments an AU's devicemay link via the Internet or the like to a third party provider runningautomated speech recognition (ASR) software that can receive voicemessages and transcribe those messages, at least somewhat accurately, totext. In these cases it is contemplated that real time transcriptionwhere accuracy needs to meet a high accuracy standard would still beperformed by a CA or software trained to a specific voice while lessaccuracy sensitive text may be generated by the third party provider, atleast some of the time for free or for a nominal fee, and transmittedback to the AU's device for display.

In other cases, it is contemplated that the AU's device 12 itself mayrun voice-to-text or ASR software to at least somewhat accuratelytranscribe voice messages to text where the text generated by the AU'sdevice would only be provided in cases where accuracy sensitivity isless than normal such as where rolling voice messages prior to selectionof a caption icon to initiate captioning are to be transcribed.

FIG. 10 shows another method 300 for providing text for voice messagesthat occurred prior to a caption request, albeit where an AU's devicegenerates the pre-request text as opposed to a relay. Referring also toFIG. 1, at block 310 a HU's voice messages are received at an AU'sdevice 12. At block 312, the AU's device 12 runs voice-to-text softwarethat, in at least some embodiments, trains on the fly to the voice of alinked HU and generates caption text.

Here, on the fly training may include assigning a confidence factor toeach automatically transcribed word and only using text that has a highconfidence factor to train a voice model for the HU. For instance, onlytext having a confidence factor greater than 95% may be used forautomatic training purposes. Here, confidence factors may be assignedbased on many different factors or algorithms, many of which are wellknown in the automatic voice recognition art. In this embodiment, atleast initially, the caption text generated by the AU's device 12 is notdisplayed to the AU in at least some embodiments. At block 314, untilthe AU requests captioning, control simply routes back up to block 310.Once captioning is requested by an AU, control passes to block 316 wherethe text corresponding to the last 20 seconds generated by the AU'sdevice is presented on the AU's device display 18. Here, while there maybe some errors in the displayed text, at least some text associated withthe most recent voice message can be quickly presented and give the AUthe opportunity to attempt to understand the voice messages associatedtherewith. At block 318 the AU's device links to a relay and at block320 the HU's ongoing voice messages are transmitted to the relay. Atblock 322, after CA transcription at the relay, the AU's device receivesthe transcribed text from the relay and at block 324 the text isdisplayed. After block 324 control passes back up to block 320 where thesub-loop including blocks 320, 322 and 324 continues to cycle.

Thus, in the above example, instead of the AU's device storing the last20 seconds of a HU's voice signal and transcribing that voice signal totext after the AU requests transcription, the AU's device constantlyruns an ASR engine behind the scenes to generate automated engine textwhich is stored without initially being presented to the AU. Then, whenthe AU requests captioning or transcription, the most recentlytranscribed text can be presented via the AU's device displayimmediately or via rapid presentation (e.g., sequentially at a speedhigher than the HU's speaking speed).

In at least some cases it is contemplated that voice-to-text softwarerun outside control of the relay may be used to generate at leastinitial text for a HU's voice and that the initial text may be presentedvia an AU's device. Here, because known software still may generate moretext transcription errors than allowed given standard accuracyrequirements in the text captioning industry, a relay correction servicemay be provided. For instance, in addition to presenting texttranscribed by the AU's device via a device display 18, the texttranscribed by the AU's device may also be transmitted to a relay 16 forcorrection. In addition to transmitting the text to the relay, the HU'svoice messages may also be transmitted to the relay so that a CA cancompare the text automatically generated by the AU's device to the HU'svoice messages. At the relay, the CA can listen to the voice of thehearing person and can observe associated text. Any errors in the textcan be corrected and corrected text blocks can be transmitted back tothe AU's device and used for in line correction on the AU's displayscreen.

One advantage to this type of system is that relatively less skilled CAsmay be retained at a lesser cost to perform the CA tasks. A relatedadvantage is that the stress level on CAs may be reduced appreciably byeliminating the need to both transcribe and correct at high speeds andtherefore CA turnover at relays may be appreciably reduced whichultimately reduces costs associated with providing relay services.

A similar system may include an AU's device that links to some otherthird party provider ASR transcription/caption server (e.g., in the“cloud”) to obtain initial captioned text which is immediately displayedto an AU and which is also transmitted to the relay for CA correction.Here, again, the CA corrections may be used by the third party providerto train the software on the fly to the HU's voice. In this case, theAU's device may have three separate links, one to the HU, a second linkto a third party provider server, and a third link to the relay. Inother cases, the relay may create the link to the third party server forASR services. Here, the relay would provide the HU's voice signal to thethird party server, would receive text back from the server to transmitto the AU device and would receive corrections from the CA to transmitto each of the AU device and the third party server. The third partyserver would then use the corrections to train the voice model to the HUvoice and would use the evolving model to continue ASR transcription. Instill other cases the third party ASR may train on an HU's voice signalbased on confidence factors and other training algorithms and completelyindependent of CA corrections.

Referring to FIG. 11, a method 360 whereby an AU's device transcribes aHU's voice to text and where corrections are made to the text at a relayis illustrated. At block 362 a HU's voice messages are received at anAU's device 12 (see also again FIG. 1). At block 364 the AU's deviceruns voice-to-text software to generate text from the received voicemessages and at block 366 the generated text is presented to the AU viadisplay 18. At block 370 the transcribed text is transmitted to therelay 16 and at block 372 the text is presented to a CA via the CA'sdisplay 50. At block 374 the CA corrects the text and at block 376corrected blocks of text are transmitted to the AU's device 12. At block378 the AU's device 12 uses the corrected blocks to correct the texterrors via in line correction. At block 380, the AU's device uses theerrors, the corrected text and the voice messages to train thecaptioning software to the HU's voice.

In some cases instead of having a relay or an AU's device run automatedvoice-to-text transcription software, a HU's device may include aprocessor that runs transcription software to generate textcorresponding to the HU's voice messages. To this end, device 14 may,instead of including a simple telephone, include a computer that can runvarious applications including a voice-to-text program or may link tosome third party real time transcription software program (e.g.,software run on a third party server in the “cloud” (e.g., Watson,Google Voice, etc.)) to obtain an initial text transcriptionsubstantially in real time. Here, as in the case where an AU's deviceruns the transcription software, the text will often have more errorsthan allowed by the standard accuracy requirements.

Again, to correct the errors, the text and the HU's voice messages aretransmitted to relay 16 where a CA listens to the voice messages,observes the text on screen 18 and makes corrections to eliminatetranscription errors. The corrected blocks of text are transmitted tothe AU's device for display. The corrected blocks may also betransmitted back to the HU's device for training the captioning softwareto the HU's voice. In these cases the text transcribed by the HU'sdevice and the HU's voice messages may either be transmitted directlyfrom the HU's device to the relay or may be transmitted to the AU'sdevice 12 and then on to the relay. Where the HU's voice messages andtext are transmitted directly to the relay 16, the voice messages andtext may also be transmitted directly to the AU's device for immediatebroadcast and display and the corrected text blocks may be subsequentlyused for in line correction.

In these cases the caption request option may be supported so that an AUcan initiate captioning during an on-going call at any time by simplytransmitting a signal to the HU's device instructing the HU's device tostart the captioning process. Similarly, in these cases the help requestoption may be supported. Where the help option is facilitated, theautomated text may be presented via the AU's device and, if the AUperceives that too many text errors are being generated, the help buttonmay be selected to cause the HU's device or the AU's device to transmitthe automated text to the relay for CA correction.

One advantage to having a HU's device manage or perform voice-to-texttranscription is that the voice signal being transcribed can be arelatively high quality voice signal. To this end, a standard phonevoice signal has a range of frequencies between 300 and about 3000 Hertzwhich is only a fraction of the frequency range used by mostvoice-to-text transcription programs and therefore, in many cases,automated transcription software does only a poor job of transcribingvoice signals that have passed through a telephone connection. Wheretranscription can occur within a digital signal portion of an overallsystem, the frequency range of voice messages can be optimized forautomated transcription. Thus, where a HU's computer that is all digitalreceives and transcribes voice messages, the frequency range of themessages is relatively large and accuracy can be increased appreciably.Similarly, where a HU's computer can send digital voice messages to athird party transcription server accuracy can be increased appreciably.

Calls of Different Sound Quality Handled Differently

In at least some configurations it is contemplated that the link betweenan AU's device 12 and a HU's device 14 may be either a standard phonetype connection or may be a digital or high definition (HD) connectiondepending on the capabilities of the HU's device that links to the AU'sdevice. Thus, for instance, a first call may be standard quality and asecond call may be high definition audio. Because high definition voicemessages have a greater frequency range and therefore can beautomatically transcribed more accurately than standard definition audiovoice messages in many cases, it has been recognized that a system whereautomated voice-to-text program use is implemented on a case by casebasis depending upon the type of voice message received (e.g., digitalor analog) would be advantageous. For instance, in at least someembodiments, where a relay receives a standard definition voice messagefor transcription, the relay may automatically link to a CA for full CAtranscription service where the CA transcribes and corrects text viarevoicing and keyboard manipulation and where the relay receives a highdefinition digital voice message for transcription, the relay may run anautomated voice-to-text transcription program to generate automatedtext. The automated text may either be immediately corrected by a CA ormay only be corrected by an assistant after a help feature is selectedby an AU as described above.

Referring to FIG. 12, one process 400 for treating high definitiondigital messages differently than standard definition voice messages isillustrated. Referring also to FIG. 1, at block 402 a HU's voicemessages are received at a relay 16. At decision block 404, relay server30 determines if the received voice message is a high definition digitalmessage or is a standard definition message (e.g., sometimes and analogmessage). Where a high definition message has been received, controlpasses to block 406 where server 30 runs an automated voice-to-textprogram on the voice messages to generate automated text. At block 408the automated text is transmitted to the AU's device 12 for display.Referring again to block 404, where the HU's voice messages are instandard definition audio, control passes to block 412 where a link to aCA is established so that the HU's voice messages are provided to a CA.At block 414 the CA listens to the voice messages and transcribes themessages into text. Error correction may also be performed at block 414.After block 414, control passes to block 408 where the CA generated textis transmitted to the AU's device 12. Again, in some cases, whenautomated text is presented to an AU, a help button may be presentedthat, when selected causes automated text to be presented to a CA forcorrection. In other cases automated text may be automatically presentedto a CA for correction.

Another system is contemplated where all incoming calls to a relay areinitially assigned to a CA for at least initial captioning where theoption to switch to automated software generated text is only availablewhen the call includes high definition audio and after accuracystandards have been exceeded. Here, all standard definition HU voicemessages would be captioned by a CA from start to finish and any highdefinition calls would cut out the CA when the standard is exceeded.

In at least some cases where an AU's device is capable of runningautomated voice-to-text transcription software, the AU's device 12 maybe programmed to select either automated transcription when a highdefinition digital voice message is received or a relay with a CA when astandard definition voice message is received. Again, where device 12runs an automated text program, CA correction may be automatic or mayonly start when a help button is selected.

FIG. 13 shows a process 430 whereby an AU's device 12 selects eitherautomated voice-to-text software or a CA to transcribe based on the type(e.g., digital or analog) of voice messages received. At block 432 aHU's voice messages are received by an AU's device 12. At decision block434, a processor in device 12 determines if the AU has selected a helpbutton. Initially no help button is selected as no text has beenpresented so at least initially control passes to block 436. At decisionblock 436, the device processor determines if a HU's voice signal thatis received is high definition digital or is standard definition. Wherethe received signal is high definition digital, control passes to block438 where the AU's device processor runs automated voice-to-textsoftware to generate automated text which is then displayed on the AUdevice display 18 at block 440.

Referring still to FIG. 13, if the help button has been selected atblock 434 or if the received voice messages are in standard definition,control passes to block 442 where a link to a CA at relay 16 isestablished and the HU's voice messages are transmitted to the relay. Atblock 444 the CA listens to the voice messages and generates text and atblock 446 the text is transmitted to the AU's device 12 where the textis displayed at block 440.

HU Recognition and Voice Training

In has been recognized that in many cases most calls facilitated usingan AU's device will be with a small group of other hearing or non-HUs.For instance, in many cases as much as 70 to 80 percent of all calls toan AU's device will be with one of five or fewer HU's devices (e.g.,family, close friends, a primary care physician, etc.). For this reasonit has been recognized that it would be useful to store voice-to-textmodels for at least routine callers that link to an AU's device so thatthe automated voice-to-text training process can either be eliminated orsubstantially expedited. For instance, when an AU initiates a captioningservice, if a previously developed voice model for a HU can beidentified quickly, that model can be used without a new trainingprocess and the switchover from a full service CA to automatedcaptioning may be expedited (e.g., instead of taking a minute or morethe switchover may be accomplished in 15 seconds or less, in the timerequired to recognize or distinguish the HU's voice from other voices).

FIG. 14 shows a sub-process 460 that may be substituted for a portion ofthe process shown in FIG. 3 wherein voice-to-text templates or modelsalong with related voice recognition profiles for callers are stored andused to expedite the handoff to automated transcription. Prior torunning sub-process 460, referring again to FIG. 1, server 30 is used tocreate a voice recognition database for storing HU device identifiersalong with associated voice recognition profiles and associatedvoice-to-text models. A voice recognition profile is a data constructthat can be used to distinguish one voice from others and provideimproved speech to text accuracy.

In the context of the FIG. 1 system, voice recognition profiles areuseful because more than one person may use a HU's device to call an AU.For instance in an exemplary case, an AU's son or daughter-in-law or oneof any of three grandchildren may routinely use device 14 to call an AUand therefore, to access the correct voice-to-text model, server 30needs to distinguish which caller's voice is being received. Thus, inmany cases, the voice recognition database will include several voicerecognition profiles for each HU device identifier (e.g., each HU phonenumber). A voice-to-text model includes parameters that are used tocustomize voice-to-text software for transcribing the voice of anassociated HU to text.

The voice recognition database will include at least one voice model foreach voice profile to be used by server 30 to automate transcriptionwhenever a voice associated with the specific profile is identified.Data in the voice recognition database will be generated on the fly asan AU uses device 12. Thus, initially the voice recognition databasewill include a simple construct with no device identifiers, profiles orvoice models.

Referring still to FIGS. 1 and 14 and now also to FIG. 3, at decisionblock 84 in FIG. 3, if the help flag is still zero (e.g., an AU has notrequested CA help to correct automated text errors) control may pass toblock 464 in FIG. 13 where the HU's device identifier (e.g., a phonenumber, an IP address, a serial number of a HU's device, etc.) isreceived by server 30. At block 468 server 30 determines if the HU'sdevice identifier has already been added to the voice recognitiondatabase. If the HU's device identifier does not appear in the database(e.g., the first time the HU's device is used to connect to the AU'sdevice) control passes to block 482 where server 30 uses a generalvoice-to-text program to convert the HU's voice messages to text afterwhich control passes to block 476. At block 476 the server 30 trains avoice-to-text model using transcription errors. Again, the training willinclude comparing CA generated text to automated text to identify errorsand using the errors to adjust model parameters so that the next time aword associated with an error is uttered by the HU, the software willidentify the correct word. At block 478, server 30 trains a voiceprofile for the HU's voice so that the next time the HU calls, a voiceprofile will exist for the specific HU that can be used to identify theHU. At block 480 the server 30 stores the voice profile and voice modelfor the HU along with the HU device identifier for future use afterwhich control passes back up to block 94 in FIG. 3.

Referring still to FIGS. 1 and 14, at block 468, if the HU's device isalready represented in the voice recognition database, control passes toblock 470 where server 30 runs voice recognition software on the HU'svoice messages in an attempt to identify a voice profile associated withthe specific HU. At decision block 472, if the HU's voice does not matchone of the previously stored voice profiles associated with the deviceidentifier, control passes to block 482 where the process describedabove continues. At block 472, if the HU's voice matches a previouslystored profile, control passes to block 474 where the voice modelassociated with the matching profile is used to tune the voice-to-textsoftware to be used to generate automated text.

Referring still to FIG. 14, at blocks 476 and 478, the voice model andvoice profile for the HU are continually trained. Continual trainingenables the system to constantly adjust the model for changes in a HU'svoice that may occur over time or when the HU experiences some physicalcondition (e.g., a cold, a raspy voice) that affects the sound of theirvoice. At block 480, the voice profile and voice model are stored withthe HU device identifier for future use.

In at least some embodiments, server 30 may adaptively change the orderof voice profiles applied to a HU's voice during the voice recognitionprocess. For instance, while server 30 may store five different voiceprofiles for five different HUs that routinely connect to an AU'sdevice, a first of the profiles may be used 80 percent of the time. Inthis case, when captioning is commenced, server 30 may start by usingthe first profile to analyze a HU's voice at block 472 and may cyclethrough the profiles from the most matched to the least matched.

To avoid server 30 having to store a different voice profile and voicemodel for every hearing person that communicates with an AU via device12, in at least some embodiments it is contemplated that server 30 mayonly store models and profiles for a limited number (e.g., 5) offrequent callers. To this end, in at least some cases server 30 willtrack calls and automatically identify the most frequent HU devices usedto link to the AU's device 12 over some rolling period (e.g., 1 month)and may only store models and profiles for the most frequent callers.Here, a separate counter may be maintained for each HU device used tolink to the AU's device over the rolling period and different models andprofiles may be swapped in and out of the stored set based on frequencyof calls.

In other embodiments server 30 may query an AU for some indication thata specific HU is or will be a frequent contact and may add that personto a list for which a model and a profile should be stored for a totalof up to five persons.

While the system described above with respect to FIG. 14 assumes thatthe relay 16 stores and uses voice models and voice profiles that aretrained to HU's voices for subsequent use, in at least some embodimentsit is contemplated that an AU's device 12 processor may maintain and useor at least have access to and use the voice recognition database togenerate automated text without linking to a relay. In this case,because the AU's device runs the software to generate the automatedtext, the software for generating text can be trained any time theuser's device receives a HU's voice messages without linking to a relay.For example, during a call between a HU and an AU on devices 14 and 12,respectively, in FIG. 1, and prior to an AU requesting captioningservice, the voice messages of even a new HU can be used by the AU'sdevice to train a voice-to-text model and a voice profile for the user.In addition, prior to a caption request, as the model is trained andgets better and better, the model can be used to generate text that canbe used as fill in text (e.g., text corresponding to voice messages thatprecede initiation of the captioning function) when captioning isselected.

FIG. 15 shows a process 500 that may be performed by an AU's device totrain voice models and voice profiles and use those models and profilesto automate text transcription until a help button is selected.Referring also to FIG. 1, at block 502, an AU's device 12 processorreceives a HU's voice messages as well as an identifier (e.g. a phonenumber) of the HU's device 14. At block 504 the processor determines ifthe AU has selected the help button (e.g., indicating that currentcaptioning includes too many errors). If an AU selects the help buttonat block 504, control passes to block 522 where the AU's device islinked to a CA at relay 16 and the HU's voice is presented to the CA. Atblock 524 the AU's device receives text back from the relay and at block534 the CA generated text is displayed on the AU's device display 18.

Where the help button has not been selected, control passes to block 505where the processor uses the device identifier to determine if the HU'sdevice is represented in the voice recognition database. Where the HU'sdevice is not represented in the database control passes to block 528where the processor uses a general voice-to-text program to convert theHU's voice messages to text after which control passes to block 512.

Referring again to FIGS. 1 and 15, at block 512 the processor adaptivelytrains the voice model using perceived errors in the automated text. Tothis end, one way to train the voice model is to generate textphonetically and thereafter perform a context analysis of each text wordby looking at other words proximate the word to identify errors. Anotherexample of using context to identify errors is to look at severalgenerated text words as a phrase and compare the phrase to similar priorphrases that are consistent with how the specific HU strings wordstogether and identify any discrepancies as possible errors. At block 514a voice profile for the HU is generated from the HU's voice messages sothat the HU's voice can be recognized in the future. At block 516 thevoice model and voice profile for the HU are stored for future useduring subsequent calls and then control passes to block 518 where theprocess described above continues. Thus, blocks 528, 512, 514 and 516enable the AU's device to train voice models and voice profiles for HUsthat call in anew where a new voice model can be used during an ongoingcall and during future calls to provide generally accuratetranscription.

Referring still to FIGS. 1 and 15, if the HU's device is alreadyrepresented in the voice recognition database at block 505, controlpasses to block 506 where the processor runs voice recognition softwareon the HU's voice messages in an attempt to identify one of the voiceprofiles associated with the device identifier. At block 508, where novoice profile is recognized, control passes to block 528.

At block 508, if the HU's voice matches one of the stored voiceprofiles, control passes to block 510 where the voice-to-text modelassociated with the matching profile is used to generate automated textfrom the HU's voice messages. Next, at block 518, the AU's deviceprocessor determine if the caption button on the AU's device has beenselected. If captioning has not been selected control passes to block502 where the process continues to cycle. Once captioning has beenrequested, control passes to block 520 where AU's device 12 displays themost recent 10 seconds of automated text and continuing automated texton display 18.

In at least some embodiments it is contemplated that different types ofvoice model training may be performed by different processors within theoverall FIG. 1 system. For instance, while an AU's device is not linkedto a relay, the AU's device cannot use any errors identified by a callassistance at the relay to train a voice model as no CA is generatingerrors. Nevertheless, the AU's device can use context and confidencefactors to identify errors and train a model. Once an AU's device islinked to a relay where a CA corrects errors, the relay server can usethe CA identified errors and corrections to train a voice model whichcan, once sufficiently accurate, be transmitted to the AU's device wherethe new model is substituted for the old content based model or wherethe two models are combined into a single robust model in some fashion.In other cases when an AU's device links to a relay for CA captioning, acontext based voice model generated by the AU's device for the HU may betransmitted to the relay server and used as an initial model to befurther trained using CA identified errors and corrections. In stillother cases CA errors may be provided to the AU's device and used bythat device to further train a context based voice model for the HU.

Referring now to FIG. 16, a sub-process 550 that may be added to theprocess shown in FIG. 15 whereby an AU's device trains a voice model fora HU using voice message content and a relay server further trains thevoice model generated by the AU's device using CA identified errors isillustrated. Referring also to FIG. 15, sub-process 550 is intended tobe performed in parallel with block 524 and 534 in FIG. 15. Thus, afterblock 522, in addition to block 524, control also passes to block 552 inFIG. 16. At block 552 the voice model for a HU that has been generatedby an AU's device 12 is transmitted to relay 16 and at block 553 thevoice model is used to modify a voice-to-text program at the relay. Atblock 554 the modified voice-to-text program is used to convert the HU'svoice messages to automated text. At block 556 the CA generated text iscompared to the automated text to identify errors. At block 558 theerrors are used to further train the voice model. At block 560, if thevoice model has an accuracy below the required standard, control passesback to block 502 in FIG. 15 where the process described above continuesto cycle. At block 560, once the accuracy exceeds the standardrequirement, control passes to block 562 wherein server 30 transmits thetrained voice model to the AU's device for handling subsequent callsfrom the HU for which the model was trained. At block 564 the new modelis stored in the database maintained by the AU's device.

Referring still to FIG. 16, in addition to transmitting the trainedmodel to the AU's device at block 562, once the model is accurate enoughto meet the standard requirements, server 30 may perform an automatedprocess to cut out the CA and instead transmit automated text to theAU's device as described above in FIG. 1. In the alternative, once themodel has been transmitted to the AU's device at block 562, the relaymay be programmed to hand off control to the AU's device which wouldthen use the newly trained and relatively more accurate model to performautomated transcription so that the relay could be disconnected.

Several different concepts and aspects of the present disclosure havebeen described above. It should be understood that many of the conceptsand aspects may be combined in different ways to configure other triagesystems that are more complex. For instance, one exemplary system mayinclude an AU's device that attempts automated captioning with on thefly training first and, when automated captioning by the AU's devicefails (e.g., a help icon is selected by an AU), the AU's device may linkto a third party captioning system via the internet or the like whereanother more sophisticated voice-to-text captioning software is appliedto generate automated text. Here, if the help button is selected asecond time or a “CA” button is selected, the AU's device may link to aCA at the relay for CA captioning with simultaneous voice-to-textsoftware transcription where errors in the automated text are used totrain the software until a threshold accuracy requirement is met. Here,once the accuracy requirement is exceeded, the system may automaticallycut out the CA and switch to the automated text from the relay until thehelp button is again selected. In each of the transcription hand offs,any learning or model training performed by one of the processors in thesystem may be provided to the next processor in the system to be used toexpedite the training process.

Line Check Words

In at least some embodiments an automated voice-to-text engine may beutilized in other ways to further enhance calls handled by a relay. Forinstance, in cases where transcription by a CA lags behind a HU's voicemessages, automated transcription software may be programmed totranscribe text all the time and identify specific words in a HU's voicemessages to be presented via an AU's display immediately when identifiedto help the AU determine when a HU is confused by a communication delay.For instance, assume that transcription by a CA lags a HU's most currentvoice message by 20 seconds and that an AU is relying on the CAgenerated text to communicate with the HU. In this case, because the CAgenerated text lag is substantial, the HU may be confused when the AU'sresponse also lags a similar period and may generate a voice messagequestioning the status of the call. For instance, the HU may utter “Areyou there?” or “Did you hear me?” or “Hello” or “What did you say?”.These phrases and others like them querying call status are referred toherein as “line check words” (LCWs) as the HU is checking the status ofthe call on the line.

If the line check words are not presented until they occurredsequentially in the HU's voice messages, they would be delayed for 20 ormore seconds in the above example. In at least some embodiments it iscontemplated that the automated voice engine may search for line checkwords (e.g., 50 common line check phrases) in a HU's voice messages andpresent the line check words immediately via the AU's device during acall regardless of which words have been transcribed and presented to anAU. The AU, seeing line check words or a phrase can verbally respondthat the captioning service is lagging but catching up so that theparties can avoid or at least minimize confusion. In the alternative, asystem processor may automatically respond to any line check words bybroadcasting a voice message to the HU indicating that transcription islagging and will catch up shortly. The automated message may also bebroadcast to the AU so that the AU is also aware of the HU's situation.

When line check words are presented to an AU the words may be presentedin-line within text being generated by a CA with intermediate blanksrepresenting words yet to be transcribed by the CA. To this end, seeagain FIG. 17 that shows line check words “Are you still there?” in ahighlighting box 590 at the end of intermediate blanks 216 representingwords yet to be transcribed by the CA. Line check words will, in atleast some embodiments, be highlighted on the display or otherwisevisually distinguished. In other embodiments the line check words may belocated at some prominent location on the AU's display screen (e.g., ina line check box or field at the top or bottom of the display screen).

One advantage of using an automated voice engine to only search forspecific words and phrases is that the engine can be tuned for thosewords and will be relatively more accurate than a general purpose enginethat transcribes all words uttered by a HU. In at least some embodimentsthe automated voice engine will be run by an AU's device processor whilein other embodiments the automated voice engine may be run by the relayserver with the line check words transmitted to the AU's deviceimmediately upon generation and identification.

In still other cases where automated text is presented immediately upongeneration to an AU, line check words may be presented in a visuallydistinguished fashion (e.g., highlighted, in different color, as adistinct font, as a uniquely sized font, etc.) so that an AU candistinguish those words from others and, where appropriate, provide aclarifying remark to a confused HU.

Referring now to FIG. 19, a process 600 that may be performed by an AU'sdevice 12 and a relay to transcribe HU's voice messages and provide linecheck words immediately to an AU when transcription by a CA lags inillustrated. At block 602 a HU's voice messages are received by an AU'sdevice 12. After block 602 control continues along parallelsub-processes to blocks 604 and 612. At block 604 the AU's deviceprocessor uses an automated voice engine to transcribe the HU's voicemessages to text. Here, it is assumed that the voice engine may generateseveral errors and therefore likely would be insufficient for thepurposes of providing captioning to the AU. The engine, however, isoptimized and trained to caption a set (e.g., 10 to 100) line checkwords and/or phrases which the engine can do extremely accurately. Atblock 606, the AU's device processor searches for line check words inthe automated text. At block 608, if a line check word or phrase is notidentified control passes back up to block 602 where the processcontinues to cycle. At block 608, if a line check word or phrase isidentified, control passes to block 610 where the line check word/phraseis immediately presented (see phrase “Are you still there?” in FIG. 18)to the AU via display 18 either in-line or in a special location and, inat least some cases, in a visually distinct manner.

Referring still to FIG. 19, at block 612 the HU's voice messages aresent to a relay for transcription. At block 614, transcribed text isreceived at the AU's device back from the relay. At block 616 the textfrom the relay is used to fill in the intermediate blanks (see againFIG. 17 and also FIG. 18 where text has been filled in) on the AU'sdisplay.

ASR Suggests Errors in CA Generated Text

In at least some embodiments it is contemplated that an automatedvoice-to-text engine may operate all the time and may check for andindicate any potential errors in CA generated text so that the CA candetermine if the errors should be corrected. For instance, in at leastsome cases, the automated voice engine may highlight potential errors inCA generated text on the CA's display screen inviting the CA tocontemplate correcting the potential errors. In these cases the CA wouldhave the final say regarding whether or not a potential error should bealtered.

Consistent with the above comments, see FIG. 20 that shows a screen shotof a CA's display screen where potential errors have been highlighted todistinguish the errors from other text. Exemplary CA generated text isshown at 650 with errors shown in phantom boxes 652, 654 and 656 thatrepresent highlighting. In the illustrated example, exemplary wordsgenerated by an automated voice-to-text engine are also presented to theCA in hovering fields above the potentially erroneous text as shown at658, 660 and 662. Here, a CA can simply touch a suggested correction ina hovering field or use a pointing device such as a mouse controlledcursor to select a presented word to make a correction and replace theerroneous word with the automated text suggested in the hovering field.If a CA instead touches an error, the CA can manually change the word toanother word. If a CA does not touch an error or an associated correctedword, the word remains as originally transcribed by the CA. An “AcceptAll” icon is presented at 669 that can be selected to accept all of thesuggestions presented on a CA's display. All corrected words aretransmitted to an AU's device to be displayed.

Referring to FIG. 21, a method 700 by which a voice engine generatestext to be compared to CA generated text and for providing a correctioninterface as in FIG. 20 for the CA is illustrated. At block 702 the HU'svoice messages are provided to a relay. After block 702 control followsto two parallel paths to blocks 704 and 716. At block 704 the HU's voicemessages are transcribed into text by an automated voice-to-text enginerun by the relay server before control passes to block 706. At block 716a CA transcribes the HU's voice messages to CA generated text. At block718 the CA generated text is transmitted to the AU's device to bedisplayed. At block 720 the CA generated text is displayed on the CA'sdisplay screen 50 for correction after which control passes to block706.

Referring still to FIG. 21, at block 706 the relay server compares theCA generated text to the automated text to identify any discrepancies.Where the automated text matches the CA generated text at block 708,control passes back up to block 702 where the process continues. Wherethe automated text does not match the CA generated text at block 708,control passes to block 710 where the server visually distinguishes themismatched text on the CA's display screen 50 and also presentssuggested correct text (e.g., the automated text). Next, at block 712the server monitors for any error corrections by the CA and at block 714if an error has been corrected, the corrected text is transmitted to theAU's device for in-line correction.

In at least some embodiments the relay server may be able to generatesome type of probability or confidence factor related to how likely adiscrepancy between automated and CA generated text is related to a CAerror and may only indicate errors and present suggestions for probableerrors or discrepancies likely to be related to errors. For instance,where an automated text segment is different than an associated CAgenerated text segment but the automated segment makes no sensecontextually in a sentence, the server may not indicate the discrepancyor may not show the automated text segment as an option for correction.The same discrepancy may be shown as a potential error at a differenttime if the automated segment makes contextual sense.

In still other embodiments automated voice-to-text software thatoperates at the same time as a CA to generate text may be trained torecognize words often missed by a CA such as articles, for instance, andto ignore other words that CAs more accurately transcribe.

The particular embodiments disclosed above are illustrative only, as theinvention may be modified and practiced in different but equivalentmanners apparent to those skilled in the art having the benefit of theteachings herein. Furthermore, no limitations are intended to thedetails of construction or design herein shown, other than as describedin the claims below. It is therefore evident that the particularembodiments disclosed above may be altered or modified and all suchvariations are considered within the scope and spirit of the invention.Accordingly, the protection sought herein is as set forth in the claimsbelow.

Thus, the invention is to cover all modifications, equivalents, andalternatives falling within the spirit and scope of the invention asdefined by the following appended claims. For example, while the methodsabove are described as being performed by specific system processors, inat least some cases various method steps may be performed by othersystem processors. For instance, where a HU's voice is recognized andthen a voice model for the recognized HU is employed for voice-to-texttranscription, the voice recognition process may be performed by an AU'sdevice and the identified voice may be indicated to a relay 16 whichthen identifies a related voice model to be used. As another instance, aHU's device may identify a HU's voice and indicate the identity of theHU to the AU's device and/or the relay.

As another example, while the system is described above in the contextof a two line captioning system where one line links an AU's device to aHU's device and a second line links the AU's device to a relay, theconcepts and features described above may be used in any transcriptionsystem including a system where the HU's voice is transmitted directlyto a relay and the relay then transmits transcribed text and the HU'svoice to the AU's device.

As still one other example, while inputs to an AU's device may includemechanical or virtual on screen buttons/icons, in some embodiments otherinputs arrangements may be supported. For instance, in some cases helpor a captioning request may be indicated via a voice input (e.g., verbala request for assistance or for captioning) or via a gesture of sometype (e.g., a specific hand movement in front of a camera or othersensor device that is reserved for commencing captioning).

As another example, in at least some cases where a relay includes firstand second differently trained CAs where first CAs are trained to becapable of transcribing and correcting text and second CAs are onlytrained to be capable of correcting text, a CA may always be on a callbut the automated voice-to-text software may aid in the transcriptionprocess whenever possible to minimize overall costs. For instance, whena call is initially linked to a relay so that a HU's voice is receivedat the relay, the HU's voice may be provided to a first CA fully trainedto transcribe and correct text. Here, voice-to-text software may trainto the HU's voice while the first CA transcribes the text and after thevoice-to-text software accuracy exceeds a threshold, instead ofcompletely cutting out the relay or CA, the automated text may beprovided to a second CA that is only trained to correct errors. Here,after training the automated text should have minimal errors andtherefore even a minimally trained CA should be able to make correctionsto the errors in a timely fashion. In other cases, a first CA assignedto a call may only correct errors in automated voice-to-texttranscription and a fully trained revoicing and correcting CA may onlybe assigned after a help or caption request is received.

In other systems an AU's device processor may run automatedvoice-to-text software to transcribe HU's voice messages and may alsogenerate a confidence factor for each word in the automated text basedon how confident the processor is that the word has been accuratelytranscribed. The confidence factors over a most recent number of words(e.g., 100) or a most recent period (e.g., 45 seconds) may be averagedand the average used to assess an overall confidence factor fortranscription accuracy. Where the confidence factor is below a thresholdlevel, the device processor may link to a relay for more accuratetranscription either via more sophisticated automated voice-to-textsoftware or via a CA. The automated process for linking to a relay maybe used instead of or in addition to the process described above wherebyan AU selects a “caption” button to link to a relay.

User Customized Complex Words

In addition to storing HU voice models, a system may also store otherinformation that could be used when an AU is communicating with specificHU's to increase accuracy of automated voice-to-text software when used.For instance, a specific HU may routinely use complex words from aspecific industry when conversing with an AU. The system software canrecognize when a complex word is corrected by a CA or contextually byautomated software and can store the word and the pronunciation of theword by the specific HU in a HU word list for subsequent use. Then, whenthe specific HU subsequently links to the AU's device to communicatewith the AU, the stored word list for the HU may be accessed and used toautomate transcription. The HU's word list may be stored at a relay, byan AU's device or even by a HU's device where the HU's device has datastoring capability.

In other cases a word list specific to an AU's device (i.e., to an AU)that includes complex or common words routinely used to communicate withthe AU may be generated, stored and updated by the system. This list mayinclude words used on a regular basis by any HU that communicates withan AU. In at least some cases this list or the HU's word lists may bestored on an internet accessible database (e.g., in the “cloud”) so thatthe AU or some other person has the ability to access the list(s) andedit words on the list via an internet portal or some other networkinterface.

Where an HU's complex or hard to spell word list and/or an AU's wordlist is available, when a CA is creating CA generated text (e.g., viarevoicing, typing, etc.), an ASR engine may always operate to search theHU voice signal to recognize when a complex or difficult to spell wordis annunciated and the complex or hard to spell words may beautomatically presented to the CA via the CA display screen in line withthe CA generated text to be considered by the CA. Here, while the CAwould still be able to change the automatically generated complex word,it is expected that CA correction of those words would not occur oftengiven the specialized word lists for the specific communicating parties.

Dialect and Other Basis for Specific Transcription Programs

In still other embodiments various aspects of a HU's voice messages maybe used to select different voice-to-text software programs that areoptimized for voices having different characteristic sets. For instance,there may be different voice-to-text programs optimized for male andfemale voices or for voices having different dialects. Here, systemsoftware may be able to distinguish one dialect from others and selectan optimized voice engine/software program to increase transcriptionaccuracy. Similarly, a system may be able to distinguish a high pitchedvoice from a low pitched voice and select a voice engine accordingly.

In some cases a voice engine may be selected for transcribing a HU'svoice based on the region of a country in which a HU's device resides.For instance, where a HU's device is located in the southern part of theUnited States, an engine optimized for a southern dialect may be usedwhile a device in New England may cause the system to select an engineoptimized for another dialect. Different word lists may also be usedbased on region of a country in which a HU's device resides.

Indicating/Selecting Caption Source

In at least some cases it is contemplated that an AU's device willprovide a text or other indication to an AU to convey how text thatappears on an AU device display 18 is being generated. For instance,when automated voice-to-text software (e.g., an automated voicerecognition (ASR) system) is generating text, the phrase “SoftwareGenerated Text” may be persistently presented (see 729 in FIG. 22) atthe top of a display 18 and when CA generated text is presented, thephrase “CA Generated Text” (not illustrated) may be presented. A phrase“CA Corrected Text” (not illustrated) may be presented when automatedText is corrected by a CA.

In some cases a set of virtual buttons (e.g., 68 in FIG. 1) ormechanical buttons may be provided via an AU device allowing an AU toselect captioning preferences. For instance, captioning options mayinclude “Automated/Software Generated Text”, “CA Generated Text” (seevirtual selection button 719 in FIG. 22) and “CA Corrected Text” (seevirtual selection button 721 in FIG. 22). This feature allows an AU topreemptively select a preference in specific cases or to select apreference dynamically during an ongoing call. For example, where an AUknows from past experience that calls with a specific HU result inexcessive automated text errors, the AU could select “CA generated text”to cause CA support to persist during the duration of a call with thespecific HU.

Caption Confidence Indication

In at least some embodiments, automated voice-to-text accuracy may betracked by a system and indicated to any one or a subset of a CA, an AU,and an HU either during CA text generation or during automated textpresentation, or both. Here, the accuracy value may be over the durationof an ongoing call or over a short most recent rolling period or numberof words (e.g., last 30 seconds, last 100 words, etc.), or for a mostrecent HU turn at talking. In some cases two averages, one over a fullcall period and the other over a most recent period, may be indicated.The accuracy values would be provided via the AU device display 18 (see728 in FIG. 22) and/or the CA workstation display 50. Where an HU devicehas a display (e.g., a smart phone, a tablet, etc.), the accuracyvalue(s) may be presented via that display in at least some cases. Tothis end, see the smart phone type HU device 800 in FIG. 24 where anaccuracy rate is displayed at 802 for a call with an AU. It is expectedthat seeing a low accuracy value would encourage an HU to try toannunciate words more accurately or slowly to improve the value.

Non-Text Communication Enhancements

Human communication has many different components and the meaningsascribed to text words are only one aspect of that communication. Oneother aspect of human non-text communication includes how words areannunciated which often belies a speakers emotions or other meaning. Forinstance, a simple change in volume while words are being spoken isoften intended to convey a different level of importance. Similarly, theduration over which a word is expressed, the tone or pitch used when aphrase is annunciated, etc., can convey a different meaning. Forinstance, annunciating the word “Yes” quickly can connote a differentmeaning than annunciating the word “Yes” very slowly or such that the“s” sound carries on for a period of a few seconds. A simple text wordrepresentation is devoid of a lot of meaning in an originally spokenphrase in many cases.

In at least some embodiments of the present disclosure it iscontemplated that volume changes, tone, length of annunciation, pitch,etc., of an HU's voice signal may be sensed by automated software andused to change the appearance of or otherwise visually distinguishtranscribed text that is presented to an AU via a device display 18 sothat the AU can more fully understand and participate in a richercommunication session. To this end, see, for instance, the two textualeffects 732 and 734 in AU device text 730 in FIG. 22 where an arroweffect 732 represents a long annunciation period while abolded/italicized effect 734 represents an appreciable change in HUvoice signal volume. Many other non-textual characteristics of an HUvoice signal are contemplated and may be sensed and each may have adifferent appearance. For instance, pitch, speed of speaking, etc., mayall be automatically determined and used to provide effect distinctvisual cues along with the transcribed text.

The visual cues may be automatically provided with or used todistinguish text presented via an AU device display regardless of thesource of the text. For example, in some cases automated text may besupplemented with visual cues to indicate other communicationcharacteristics and in at least some cases even CA generated text may besupplemented with automatically generated visual cues indicating how anHU annunciates various words and phrases. Here, as voice characteristicsare detected for an HU's utterances, software tracks the voicecharacteristics in time and associates those characteristics withspecific text words or phrases generated by the CA. Then, the visualcues for each voice characteristic are used to visually distinguish theassociated words when presented to the AU.

In at least some cases an AU may be able to adjust the degree to whichtext is enhanced via visual cues or even to select preferred visual cuesfor different automatically identified voice characteristics. Forinstance, a specific AU may find fully enabled visual queuing to bedistracting and instead may only want bold capital letter visual queuingwhen an HU's volume level exceeds some threshold value. AU devicepreferences may be set via a display 18 during some type device ofcommissioning process.

In some embodiments it is contemplated that the automated software thatidentifies voice characteristics will adjust or train to an HU's voiceduring the first few seconds of a call and will continue to train tothat voice so that voice characteristic identification is normalized tothe HU's specific voice signal to avoid excessive visual queuing. Here,it has been recognized that some people's voices will have persistentvoice characteristics that would normally be detected as anomalies ifcompared to a voice standard (e.g., a typical male or female voice). Forinstance, a first HU may always speak loudly and therefore, if his voicesignal was compared to an average HU volume level, the voice signalwould exceed the average level most if not all the time. Here, to avoidalways distinguishing the first HU's voice signal with visual queuingindicating a loud voice, the software would use the HU voice signal todetermine that the first HU's voice signal is persistently loud andwould normalize to the loud signal so that words uttered within a rangeof volumes near the persistent loud volume would not be distinguished asloud. Here, if the first HU's voice signal exceeds the range about hispersistent volume level, the exceptionally loud signal may be recognizedas a clear deviation from the persistent volume level for the normalizedvoice and therefore distinguished with a visual queue for the AU whenassociated text is presented. The voice characteristic recognizingsoftware would automatically train to the persistent voicecharacteristics for each HU including for instance, pitch, tone, speedof annunciation, etc., so that persistent voice characteristics ofspecific HU voice signals are not visually distinguished as anomalies.

In at least some cases, as in the case of voice models developed andstored for specific HUs, it is contemplated that HU voice models mayalso be automatically developed and stored for specific HU's forspecifying voice characteristics. For instance, in the above examplewhere a first HU has a particularly loud persistent voice, the volumerange about the first HU's persistent volume as well as other persistentcharacteristics may be determined once during an initial call with an AUand then stored along with a phone number or other HU identifyinginformation in a system database. Here, the next time the first HUcommunicates with an AU via the system, the HU voice characteristicmodel would be automatically accessed and used to detect voicecharacteristic anomalies and to visually distinguish accordingly.

Referring again to FIG. 22, in addition to changing the appearance oftranscribed text to indicate annunciation qualities or characteristics,other visual cues may be presented. For instance, if an HU persistentlytalks in a volume that is much higher than typical for the HU, a volumeindicator 717 may be presented or visually altered in some fashion toindicate the persistent volume. As another example, a volume indicator715 may be presented above or otherwise spatially proximate any wordannunciated with an unusually high volume. In some cases thedistinguishing visual queue for a specially annunciated word may onlypersist for a short duration (e.g., 3 seconds, until the end of arelated sentence or phrase, for the next 5 words of an utterance, etc.)and then be eliminated. Here, the idea is that the visual queuing issupposed to mimic the effect of an annunciated word or phrase which doesnot persist long term (e.g., the loud effect of a high volume word onlypersists as the word is being annunciated).

The software used to generate the HU voice characteristic models and/orto detect voice anomalies to be visually distinguished may be run viaany of an HU device processor, an AU device processor, a relay processorand a third party operated processor linkable via the internet or someother network. In at least some cases it will be optimal for an HUdevice to develop the HU model for an HU that is associated with thedevice and to store the model and apply the model to the HU's voice todetect anomalies to be visually distinguished for several reasons. Inthis regard, a particularly rich acoustic HU voice signal is availableat the HU device so that anomalies can be better identified in manycases by the HU device as opposed to some processor downstream in thecaptioning process.

Sharing Text with HU

Referring again to FIG. 24, in at least some embodiments where an HUdevice 800 includes a display screen 801, an HU voice text transcription804 may also be presented via the HU device. Here, an HU viewing thetranscribed text could formulate an independent impression oftranscription accuracy and whether or not a more robust transcriptionprocess (e.g., CA generation of text) is required or would be preferred.In at least some cases a virtual “CA request” button 806 or the like maybe provided on the HU screen for selection so that the HU has theability to initiate CA text transcription and or CA correction of text.Here, an HU device may also allow an HU to switch back to automated textif an accuracy value 802 exceeds some threshold level. Where HU voicecharacteristics are detected, those characteristics may be used tovisually distinguish text at 804 in at least some embodiments.

Captioning Via HU's Device

Where an HU device is a smart phone, a tablet computing device or someother similar device capable of downloading software applications froman application store, it is contemplated that a captioning applicationmay be obtained from an application store for communication with one ormore AU devices 12. For instance, the son or daughter of an AU maydownload the captioning application to be used any time the device usercommunicates with the AU. Here, the captioning application may have anyof the functionality described in this disclosure and may result in amuch better overall system in various ways.

For instance, a captioning application on an HU device may run automatedvoice-to-text software on a digital HU voice signal as described abovewhere that text is provided to the AU device 12 for display and, attimes, to a relay for correction, voice model training, voicecharacteristic model training, etc. As another instance, an HU devicemay train a voice model for an HU any time an HU's voice signal isobtained regardless of whether or not the HU is participating in a callwith an AU. For example, if a dictation application on an HU devicewhich is completely separate from a captioning application is used todictate a letter, the HU voice signal during dictation may be used totrain a general HU voice model for the HU and, more specifically, ageneral model that can be used subsequently by the captioning system orapplication. Similarly, an HU voice signal captured during entry of asearch phrase into a browser or an address into mapping software whichis independent of the captioning application may be used to furthertrain the general voice model for the HU. Here, the general voice modelmay be extremely accurate even before used in by AU captioningapplication. In addition, an accuracy value for an HU's voice model maybe calculated prior to an initial AU communication so that, if theaccuracy value exceeds a high or required accuracy standard, automatedtext transcription may be used for an HU-AU call without requiring CAassistance, at least initially.

For instance, prior to an initial AU call, an HU device processortraining to an HU voice signal may assign confidence factors to textwords automatically transcribed by an ASR engine from HU voice signals.As the software trains to the HU voice, the confidence factor valueswould continue to increase and eventually should exceed some thresholdlevel at which initial captioning during an AU communication would meetaccuracy requirements set by the captioning industry.

As another instance, an HU voice model stored by or accessible by the HUdevice can be used to automatically transcribe text for any AU devicewithout requiring continual redevelopment or teaching of the HU voicemodel. Thus, one HU device may be used to communicate with two separatehearing impaired persons using two different AU devices without eachsub-system redeveloping the HU voice model.

As yet another instance, an HU's smart phone or tablet device running acaptioning application may link directly to each of a relay and an AU'sdevice to provide one or more of the HU voice signal, automated textand/or an HU voice model or voice characteristic model to each. This maybe accomplished through two separate phone lines or via two channels ona single cellular line or via any other combination of two communicationlinks.

In some cases an HU voice model may be generated by a relay or an AU'sdevice or some other entity (e.g., a third party ASR engine provider)over time and the HU voice model may then be stored on the HU device orrendered accessible via that device for subsequent transcription. Inthis case, one robust HU voice model may be developed for an HU by anysystem processor or server independent of the HU device and may then beused with any AU device and relay for captioning purposes.

Assessing/Indicating Communication Characteristics

In still other cases, at least one system processor may monitor andassess line and/or audio conditions associated with a call and maypresent some type of indication to each or a subset of an AU, an HU anda CA to help each or at least one of the parties involved in a call toassess communication quality. For instance, an HU device may be able toindicate to an AU and a CA if the HU device is being used as a speakerphone which could help explain an excessive error rate and help with adecision related to CA captioning involvement. As another instance, anHU's device may independently assess the level of non-HU voice signalnoise being picked up by an HU device microphone and, if the determinednoise level exceeds some threshold value either by itself or in relationto the signal strength of the HU voice signal, may perform somecompensatory or corrective function. For example, one function may be toprovide a signal to the HU indicating that the noise level is high.Another function may be to provide a noise level signal to the CA or theAU which could be indicated on one or both of the displays 50 and 18.Yet another function would be to offer one or more captioning options toany of the HU or AU or even to a text correcting CA when the noise levelexceeds the threshold level. Here, the idea is that as the noise levelincreases, the likelihood of accurate ASR captioning will typicallydecrease and therefore more accurate and robust captioning optionsshould be available.

As another instance, an HU device may transmit a known signal to an AUdevice which returns the known signal to the HU device and the HU devicemay compare the received signal to the known signal to determine line orcommunication link quality. Here, the HU may present a line qualityvalue as shown at 808 in FIG. 24 for the HU to consider. Similarly, anAU device may generate a line quality value in a similar fashion and maypresent the line quality signal (not illustrated) to the AU to beconsidered.

In some cases system devices may monitor a plurality of different systemoperating characteristics such as line quality, speaker phone use,non-voice noise level, voice volume level, voice signal pace, etc., andmay present one or more “coaching” indications to any one of or a subsetof the HU, CA and AU for consideration. Here, the coaching indicationsshould help the parties to a call understand if there is something theycan do to increase the level of captioning accuracy. Here, in at leastsome cases only the most impactful coaching indications may be presentedand different entities may receive different coaching indications. Forinstance, where noise at HU location exceeds a threshold level, a noiseindicating signal may only be presented to the HU. Where the system alsorecognizes that line quality is only average, that indication may bepresented to the AU and not to the HU while the HU's noise level remainshigh. If the HU moves to a quieter location, the noise level indicationon the HU device may be replaced with a line quality indication. Thus,the coaching indications should help individual call entities recognizecommunication conditions that they can effect or that may be the causeof or may lead to poor captioning results for the AU.

In some cases coaching may include generating a haptic feedback oraudible signal or both and a text message for an HU and/or an AU. Tothis end, while AU's routinely look at their devices to see captionsduring a caption assisted call, many HUs do not look at their devicesduring a call and simply rely on audio during communication. In the caseof an AU, in some cases even when captioning is presented to an AU theAU may look away from their device display at times when their hearingis sufficient. By providing a haptic or audible or both additionalsignals, a user's attention can be drawn to their device displays wherea warning or call state text message may present more information suchas, for instance, an instruction to “Speak louder” or “Move to a lessnoisy space”, for consideration.

Text Lag Constraints

In some embodiments an AU may be able to set a maximum text lag timesuch that automated text generated by an ASR engine is used to drive anAU device screen 18 when a CA generated text lag reaches the maximumvalue. For instance, an AU may not want text to lag behind a broadcastHU voice signal by more than 7 seconds and may be willing to accept agreater error rate to stay within the maximum lag time period. Here, CAcaptioning/correction may proceed until the maximum lag time occurs atwhich point automated text may be used to fill in the lag period up to acurrent HU voice signal on the AU device and the CA may be skipped aheadto the current HU signal automatically to continue the captioningprocess. Again, here, any automated fill in text or text not correctedby a CA may be visually distinguished on the AU device display as wellas on the CA display for consideration.

It has been recognized that many AU's using text to understand abroadcast HU voice signal prefer that the text lag behind the voicesignal at least some short amount of time. For instance, an AU talkingto an HU may stair off into space while listening to the HU voice signaland, only when a word or phrase is not understood, may look to text ondisplay 18 for clarification. Here, if text were to appear on a display18 immediately upon audio broadcast to an AU, the text may be severalwords beyond the misunderstood word by the time the AU looks at thedisplay so that the AU would be required to hunt for the word. For thisreason, in at least some embodiments, a short minimum text delay may beimplemented prior to presenting text on display 18. Thus, all text wouldbe delayed at least 2 seconds in some cases and perhaps longer where atext generation lag time exceeds the minimum lag value. As with otheroperating parameters, in at least some cases an AU may be able to adjustthe minimum voice-to-text lag time to meet a personal preference.

It has been recognized that in cases where transcription switchesautomatically from a CA to an ASR engine when text lag exceeds somemaximum lag time, it will be useful to dynamically change the thresholdperiod as a function of how a communication between an HU and an AU isprogressing. For instance, periods of silence in an HU voice signal maybe used to automatically adjust the maximum lag period. For example, insome cases if silence is detected in an HU voice signal for more thanthree seconds, the threshold period to change from CA text to automatictext generation may be shortened to reflect the fact that when the HUstarts speaking again, the CA should be closer to a caught up state.Then, as the HU speaks continuously for a period, the threshold periodmay again be extended. The threshold period prior to automatictransition to the ASR engine to reduce or eliminate text lag may bedynamically changed based on other operating parameters. For instance,rate of error correction by a CA, confidence factor average in ASR text,line quality, noise accompanying the HU voice signal, or any combinationof these and other factors may be used to change the threshold period.

One aspect described above relates to an ASR engine recognizing specificor important phrases like questions (e.g., see phrase “Are you stillthere?”) in FIG. 18 prior to CA text generation and presenting thosephrases immediately to an AU upon detection. Other important phrases mayinclude phrases, words or sound anomalies that typically signify “turnmarkers” (e.g., words or sounds often associated with a change inspeaker from AU to HU or vice versa). For instance, if an HU utters thephrase “What do you think?” followed by silence, the combinationincluding the silent period may be recognized as a turn marker and thephrase may be presented immediately with space markers (e.g., underlinedspaces) between CA text and the phrase to be filled in by the CA texttranscription once the CA catches up to the turn marker phrase.

To this end, see the text at 731 in FIG. 22 where CA generated text isshown at 733 with a lag time indicated by underlined spaces at 735 andan ASR recognized turn marker phrase presented at 737. In this type ofsystem, in some cases the ASR engine will be programmed with a small set(e.g., 100-300) of common turn marker phrases that are specificallysought in an HU voice signal and that are immediately presented to theAU when detected. In some cases, non-text voice characteristics like thechange in sound that occurs at the end of a question which is often thesignal for a turn marker may be sought in an HU voice signal and any ASRgenerated text within some prior period (e.g., 5 seconds, the previous 8words, etc.) may be automatically presented to an AU.

Automatic Voice Signal Routing Based on Call Type

It has been recognized that some types of calls can almost always beaccurately handled by an ASR engine. For instance, auto-attendant typecalls can typically be transcribed accurately via an ASR. For thisreason, in at least some embodiments, it is envisioned that a systemprocessor at the AU device or at the relay may be able to determine acall type (e.g., auto-attendant or not, or some other call typeroutinely accurately handled by an ASR engine) and automatically routecalls within the overall system to the best and most efficient/effectiveoption for text generation. Thus, for example, in a case where an AUdevice manages access to an ASR operated by a third party and accessiblevia an internet link, when an AU places a call that is received by anauto-attendant system, the AU device may automatically recognize theanswering system as an auto-attendant type and instead of transmittingthe auto-attendant voice signal to a relay for CA transcription, maytransmit the auto-attendant voice signal to the third party ASR enginefor text generation.

In this example, if the call type changes mid-stream during itsduration, the AU device may also transmit the received voice signal to aCA for captioning if appropriate. For instance, if an interactive voicerecognition auto-attendant system eventually routes the AU's call to alive person (e.g., a service representative for a company), once thelive person answers the call, the AU device processor may recognize theperson's voice as a non-auto-attendant signal and route that signal to aCA for captioning as well as to the ASR for voice model training. Inthese cases, the ASR engine may be specially tuned to transcribeauto-attendant voice signals to text and, when a live HU gets on theline, would immediately start training a voice model for that HU's voicesignal.

Synchronizing Voice and Text for Playback

In cases or at times when HU voice signals are transcribed automaticallyto text via an ASR engine when a CA is only correcting ASR generatedtext, the relay may include a synchronizing function or capability sothat, as a CA listens to an HU's voice signal during an error correctionprocess, the associated text from the ASR is presented generallysynchronously to the CA with the HU voice signal. For instance, in somecases an ASR transcribed word may be visually presented via a CA display50 at substantially the same instant at which the word is broadcast tothe CA to hear. As another instance, the ASR transcribed word may bepresented one, two, or more seconds prior to broadcast of that word tothe CA.

In still other cases, the ASR generated text may be presented forcorrection via a CA display 50 immediately upon generation and, as theCA controls broadcast speed of the HU voice signal for correctionpurposes, the word or phrase instantaneously audibly broadcast may behighlighted or visually distinguished in some fashion. To this end, seeFIG. 23 where automated ASR generated text is shown at 748 where a wordinstantaneously audibly broadcast to a CA (see 752) is simultaneouslyhighlighted at 750. Here, as the words are broadcast via CA headset 54,the text representations of the words are highlighted or otherwisevisually distinguished to help the error correcting CA follow along.Here, highlighting may be linked to the start time of a word beingbroadcast, to the end time of the word being broadcast, or in any otherway to the start or end time of the word. For instance, in some cases aword may be highlighted one second prior to broadcast of the word andmay remain highlighted for one second subsequent to the end time of thebroadcast so that several words are typically highlighted at a timegenerally around a currently audibly broadcast word.

As another example, see FIG. 23A where ASR generated text is shown at748A. Here, a word 752A instantaneously broadcast to a CA via headset 54is highlighted at 750A. In this case, however, ASR text scrolls up aswords are audibly broadcast to the CA so that a line of text includingan instantaneously broadcast word is always generally located at thesame vertical height on the display screen 50 (e.g., just above ahorizontal center line in the exemplary embodiment in FIG. 23A). Here,by scrolling the text up, unless correcting text in a different line,the CA can simply focus on the one line of text presented in stationaryfield 753 and specifically the highlighted word at 750A to focus on theword audibly broadcast. In other cases it is contemplated that thehighlight at 750A may in fact be a stationary word field and that eventhe line of text in field 753 may scroll from right to left so that theinstantaneously broadcast word will be located in a stationary wordfield generally near the center of the screen 50. In this way the CA maybe able to simply concentrate on one screen location to view thebroadcast word.

Referring still to FIG. 23A, a selectable button 751 (hereinafter a“caption source switch button” unless indicated otherwise) allows a CAto manually switch from the ASR text generation to full CA assistancewhere the CA generates text and corrects that text instead of startingwith ASR generated text. In addition, a “seconds behind” field 755 ispresented proximate the highlighted broadcast word 750A so that the CAhas ready access to that field to ascertain how far behind the CA is interms of listening to the HU voice message for correction. In addition,an HU silent field 757 is presented that indicates a duration of timebetween HU voice message segments during which the HU remains silent(e.g., does not speak). Here, in some cases the HU may simply pause toallow the AU to respond and that pause would be considered silence.

Referring still to FIG. 23A, field 755 indicates that the audiblebroadcast is only 12.2 seconds behind despite the illustrated 20 secondsof HU silence at 757 and many ASR words that follow the instantaneouslybroadcast word at 750A. Here, a system processor accounts for the 20seconds of HU silence when calculating the seconds behind value as thesystem can remove that silent period from CA consideration so that theCA can catch up more quickly. Thus, in the FIG. 23A example, theduration of time between when an HU actually uttered the words“restaurant” at 750A and “not” at 759 may be 32.2 seconds but the systemcan recognize that the HU was silent during 20 of those seconds so thatthe seconds behind calculation may be 12.2 seconds as shown.

In at least some cases when the seconds behind delay exceeds somethreshold value, the system may automatically indicate that condition asa warning or alert to the CA. For instance, assume that the thresholddelay is four seconds. Here, when the second behind value exceeds fourseconds, in at least some cases, the seconds behind field may behighlighted or otherwise visually distinguished as an alert. In FIG.23A, field 755 is shown as left down to right cross hatched to indicatethe color red as an alert because the four second delay threshold isexceeded.

In at least some cases it is contemplated that more sophisticatedalgorithms may be implemented for determining when to alert the CA to acircumstance where the seconds behind period becomes problematic. Forinstance, where a seconds behind duration is 12.2 seconds as in FIG.23A, that magnitude of duration may not warrant an alert if confidencefactors associated with ASR generated text thereafter are all extremelyhigh as accurate ASR text thereafter should enable the CA to catch uprelatively quickly to reduce the seconds behind period rapidly. Forinstance, where ASR text confidence factors are high, the system mayautomatically double the broadcast rate of the HU voice signal so thatthe 12.2 second delay can be worked to a zero value in half that time.

As another instance, because HUs speak at different rates at differenttimes, rate of HU speaking or density of words spoken during a timesegment may be used to qualify the delay between a broadcast word and amost recent ASR word generated. For instance, assume a 15 second delaybetween when a word is broadcast to a CA and the time associated withthe most recent ASR generated text. Here, in some cases an HU may utter3 words during the 15 second period while in other cases the HU may haveuttered 30 words during that same period. Clearly, the time required fora CA's to work the 15 second delay downward is a function of the densityof words uttered by the HU in the intervening time. Here, whether or notto issue the alert would be a function of word density during the delayperiod.

As yet one other instance, instead of assessing delay by a duration oftime, the relay may be based on a number of words between a mostrecently generated ASR word and the word that is currently beingconsidered by a CA (e.g., the most current word in an HU voice signalconsidered by the CA). Here, an alert may be issued to the CA when theCA is a threshold number of words behind the most recent ASR generatedword. For example, the threshold may be 12 words.

Many other factors may be used to determine when to issue CA delayalerts. For instance, a CAs metrics related to specific HU voicecharacteristics, voice signal quality factors, etc., may each be usedseparately or in combination with other factors to assess when an alertis prudent.

In addition to affecting when to issue a delay alert to a user, theabove factors may be used to alter the seconds behind value in field 755to reflect an anticipated duration of time required by a specific CA tocatch up to the most recently generated ASR text. For instance, in FIG.23A if, based on one or more of the above factors, the systemanticipates that it will take the CA 5 seconds to catch up on the 12.2second delay, the seconds behind value may be 5.0 seconds as opposed to12.2 (e.g., in a case where the system speeds up the rate of HU voicesignal broadcast through high confidence ASR text).

In at least some cases an error correcting CA will be able to skip backand forth within the HU voice signal to control broadcast of the HUvoice signal to the CA. For instance, as described above, a CA may havea foot pedal or other control interface device useable to skip back in abuffered HU voice recording 5, 10, etc., seconds to replay an HU voicesignal recording. Here, when the recording skips back, the highlightedtext in representation 748 would likewise skip back to be synchronizedwith the broadcast words. To this end, see FIG. 25 where, in at leastsome cases, a foot pedal activation or other CA input may cause therecording to skip back to the word “pizza” which is then broadcast as at764 and highlighted in text 748 as shown at 762. In other cases, the CAmay simply single tap or otherwise select any word presented on display50 to skip the voice signal play back and highlighted text to that word.For instance, in FIG. 25 icon 766 represents a single tap which causesthe word “pizza” to be highlighted and substantially simultaneouslybroadcast. Other word selecting gestures (e.g., a mouse control click,etc.) are contemplated.

In some embodiments when a CA selects a text word to correct, the voicesignal replay may automatically skip to some word in the voice bufferrelative to the selected word and may halt voice signal replayautomatically until the correction has been completed. For instance, adouble tap on the word “pals' in FIG. 23 may cause that word to behighlighted for correction and may automatically cause the point in theHU voice replay to move backward to a location a few words prior to theselected word “pals.” To this end, see in FIG. 25 that the word “Pete's”that is still highlighted as being corrected (e.g., the CA has notconfirmed a complete correction) has been typed in to replace the word“Pals” and the word “pizza” that precedes the word “Pete's” has beenhighlighted to indicate where the HU voice signal broadcast will againcommence after the correction at 760 has been completed. While backwardreplay skipping has been described, forward skipping is alsocontemplated.

In some cases, when a CA selects a word in presented text for correctionor at least to be considered for correction, the system may skip to alocation a few words prior to the selected word and may represent the HUvoice signal stating at that point and ending a few words after thatpoint to give a CA context in which to hear the word to be corrected.Thereafter, the system may automatically move back to a subsequent pointin the HU voice signal at which the CA was when the word to be correctedwas selected. For instance, again, in FIG. 25, assume that the HU voicebroadcast to a CA is at the word “catch” 761 when the CA selects theword “Pete's 760 for correction. In this case, the CA's interface mayskip back in the HU voice signal to the word pizza at 762 andre-broadcast the phrase parts from the word “pizza” to the word “want”763 to provide immediate context to the CA. After broadcasting the word“want”, the interface would skip back to the word “catch” 761 andcontinue broadcasting the HU voice signal from that point on.

In at least some embodiments where an ASR engine generates automatictext and a CA is simply correcting that text prior to transmission to anAU, the ASR engine may assign a confidence factor to each word generatedthat indicates how likely it is that the word is accurate. Here, in atleast some cases, the relay server may highlight any text on thecorrecting CA's display screen that has a confidence factor lower thansome threshold level to call that text to the attention of the CA forspecial consideration. To this end, see again FIG. 23 where variouswords (e.g., 777, 779, 781) are specially highlighted in theautomatically generated ASR text to indicate a low confidence factor.

While AU voice signals are not presented to a CA in most cases forprivacy reasons, it is believed that in at least some cases a CA mayprefer to have some type of indication when an AU is speaking to helpthe CA understand how a communication is progressing. To this end, in atleast some embodiments an AU device may sense an AU voice signal and atleast generate some information about when the AU is speaking. Thespeaking information, without word content, may then be transmitted inreal time to the CA at the relay and used to present an indication thatthe AU is speaking on the CA screen. For instance, see again FIG. 23where lines 783 are presented on display 50 to indicate that an AU isspeaking. As shown, lines 783 are presented on a right side of thedisplay screen to distinguish the AU's speaking activity from the textand other visual representations associated with the HU's voice signal.As another instance, when the AU speaks, a text notice 797 or somegraphical indicator (e.g., a talking head) may be presented on the CAdisplay 50 to indicate current speaking by an AU. While not shown it iscontemplated that some type of non-content AU speaking indication like783 may also be presented to an AU via the AU's device to help the AUunderstand how the communication is progressing.

Sequential Short Duration Third Party Caption Requests

It has been recognized that some third party ASR systems available viathe internet or the like tend to be extremely accurate for short voicesignal durations (e.g., 15-30 seconds) after which accuracy becomes lessreliable. To deal with ASR accuracy degradation during an ongoing call,in at least some cases where a third party ASR system is employed togenerate automated text, the system processor (e.g., at the relay, inthe AU device or in the HU device) may be programmed to generate aseries of automatic text transcription requests where each request onlytransmits a short sub-set of a complete HU voice signal. For instance, afirst ASR request may be limited to a first 15 seconds of HU voicesignal, a second ASR request may be limited to a next 15 seconds of HUvoice signal, a third ASR request may be limited to a third 15 secondsof HU voice signal, and so on. Here, each request would present theassociated HU signal to the ASR system immediately and continuously asthe HU voice signal is received and transcribed text would be receivedback from the ASR system during the 15 second period. As the text isreceived back from the ASR system, the text would be cobbled together toprovide a complete and relatively accurate transcript of the HU voicesignal.

While the HU voice signal may be divided into consecutive periods insome cases, in other cases it is contemplated that the HU voice signalslices or sub-periods sent to the ASR system may overlap at leastsomewhat to ensure all words uttered by an HU are transcribed and toavoid a case where words in the HU voice signal are split among periods.For instance, voice signal periods may be 30 seconds long and each mayoverlap a preceding period by 10 seconds and a following period by 10seconds to avoid split words. In addition to avoiding a split wordproblem, overlapping HU voice signal periods presented to an ASR systemallows the system to use context represented by surrounding words tobetter (e.g., contextually) covert HU voiced words to text. Thus, a wordat the end of a first 20 second voice signal period will be near thefront end of the overlapping portion of a next voice signal period andtherefore, typically, will have contextual words prior to and followingthe word in the next voice signal period so that a more accuratecontextually considered text representation can be generated.

In some cases, a system processor may employ two, three or moreindependent or differently tuned ASR systems to automatically generateautomated text and the processor may then compare the text results andformulate a single best transcript representation in some fashion. Forinstance, once text is generated by each engine, the processor may pollfor most common words or phrases and then select most common as text toprovide to an AU, to a CA, to a voice modeling engine, etc.

Default ASR, User Selects Call Assistance

In most cases automated text (e.g., ASR generated text) will begenerated much faster than CA generated text or at least consistentlymuch faster. It has been recognized that in at least some cases an AUwill prefer even uncorrected automated text to CA corrected text wherethe automated text is presented more rapidly generated and thereforemore in sync with an audio broadcast HU voice signal. For this reason,in at least some cases, a different and more complex voice-to-texttriage process may be implemented. For instance, when an AU-HU callcommences and the AU requires text initially, automated ASR generatedtext may initially be provided to the AU. If a good HU voice modelexists for the HU, the automated text may be provided without CAcorrection at least initially. If the AU, a system processor, or an HUdetermines that the automated text includes too many errors or if someother operating characteristic (e.g., line noise) that may affect texttranscription accuracy is sensed, a next level of the triage process maylink an error correcting CA to the call and the ASR text may bepresented in essentially real time to the CA via display 50simultaneously with presentation to the AU via display 18.

Here, as the CA corrects the automated text, corrections areautomatically sent to the AU device and are indicated via display 18.Here, the corrections may be in-line (e.g., erroneous text replaced),above error, shown after errors, may be visually distinguished viahighlighting or the like, etc. Here, if too many errors continue topersist from the AU's perspective, the AU may select an AU device button(e.g., see 68 again in FIG. 1) to request full CA transcription.Similarly, if an error correcting CA perceives that the ASR engine isgenerating too many errors, the error correcting CA may perform someaction to initiate full CA transcription and correction. Similarly, arelay processor or even an AU device processor may detect that an errorcorrecting CA is having to correct too many errors in the ASR generatedtext and may automatically initiate full CA transcription andcorrection.

In any case where a CA takes over for an ASR engine to generate text,the ASR engine may still operate on the HU voice signal to generate textand use that text and CA generated text, including corrections, torefine a voice model for the HU. At some point, once the voice modelaccuracy as tested against the CA generated text reaches some thresholdlevel (e.g., 95% accuracy), the system may again automatically or at thecommand of the transcribing CA or the AU, revert back to the CAcorrected ASR text and may cut out the transcribing CA to reduce costs.Here, if the ASR engine eventually reaches a second higher accuracythreshold (e.g., 98% accuracy), the system may again automatically or atthe command of an error correcting CA or an AU, revert back to theuncorrected ASR text to further reduce costs.

AU Accuracy-Speed Preference Selection

In at least some cases it is contemplated that an AU device may allow anAU to set a personal preference between text transcription accuracy andtext speed. For instance, a first AU may have fairly good hearing andtherefore may only rely on a text transcript periodically to identify aword uttered by an HU while a second AU has extremely bad hearing andeffectively reads every word presented on an AU device display. Here,the first AU may prefer text speed at the expense of some accuracy whilethe second AU may require accuracy even when speed of text presentationor correction is reduced. An exemplary AU device tool is shown as anaccuracy/speed scale 770 in FIG. 18 where an accuracy/speed selectionarrow 772 indicates a current selected operating characteristic. Here,moving arrow 772 to the left, operating parameters like correction time,ASR operation etc., are adjusted to increase accuracy at the expense ofspeed and moving arrow 772 right on scale 770 increases speed of textgeneration at the expense of accuracy.

In at least some embodiments when arrow 772 is moved to the right sospeed is preferred over greater accuracy, the system may respond to thesetting adjustment by opting for automated text generation as opposed toCA text generation. In other cases where a CA may still perform at leastsome error corrections despite a high speed setting, the system maylimit the window of automated text that a CA is able to correct to asmall time window trailing a current time. Thus, for instance, insteadof allowing a CA to correct the last 30 seconds of automated text, thesystem may limit the CA to correcting only the most recent 7 seconds oftext so that error corrections cannot lag too far behind current HUutterances.

Where an AU moves arrow 772 to the left so that speed is sacrificed forgreater caption accuracy, the system may delay delivery of evenautomated text to an AU for some time so that at least some automatederror corrections are made prior to delivery of initial text captions toan AU. The delay may even be until a CA has made at least some or evenall caption corrections. Other ways of speeding up text generation orincreasing accuracy at the expense of speed are contemplated.

Audio-Text Synchronization Adjustment

In at least some embodiments when text is presented to an errorcorrecting CA via a CA display 50, the text may be presented at leastslightly prior to broadcast of (e.g., ¼ to 2 seconds) an associated HUvoice signal. In this regard, it has been recognized that many CAsprefer to see text prior to hearing a related audio signal and link thetwo optimally in their minds when text precedes audio. In other casesspecific CAs may prefer simultaneous text and audio and still others mayprefer audio before text. In at least some cases it is contemplated thata CA workstation may allow a CA to set text-audio sync preferences. Tothis end, see exemplary text-audio sync scale 765 in FIG. 25 thatincludes a sync selection arrow 767 that can be moved along the scale tochange text-audio order as well as delay or lag between the two.

In at least some embodiments an on-screen tool akin to scale 765 andarrow 767 may be provided on an AU device display 18 to adjust HU voicesignal broadcast and text presentation timing to meet an AU'spreferences.

System Options Based on HU's Voice Characteristics

It has been recognized that some AU's can hear voice signals with aspecific characteristic set better than other voice signals. Forinstance, one AU may be able to hear low pitch traditionally male voicesbetter than high pitch traditionally female voice signals. In someembodiments an AU may perform a commissioning procedure whereby the AUtests capability to accurately hear voice signals having differentcharacteristics and results of those capabilities may be stored in asystem database. The hearing capability results may then be used toadjust or modify the way text captioning is accomplished. For instance,in the above case where an AU hears low pitch voices well but not highpitch voices, if a low pitch HU voice is detected when a call commences,the system may use the ASR function more rapidly than in the case of ahigh pitched voice signal. Voice characteristics other than pitch may beused to adjust text transcription and ASR transition protocols insimilar ways.

In some cases it is contemplated that an AU device or other systemdevice may be able to condition an incoming HU voice signal so that thesignal is optimized for a specific AU's hearing deficiency. Forinstance, assume that an AU only hears high pitch voices well. In thiscase, if a high pitch HU voice signal is received at an AU's device, theAU's device may simply broadcast that voice signal to the AU to beheard. However, if a low pitch HU voice signal is received at the AU'sdevice, the AU's device may modify that voice signal to convert it to ahigh pitch signal prior to broadcast to the AU so that the A can betterhear the broadcast voice. This automatic voice conditioning may beperformed regardless of whether or not the system is presentingcaptioning to an AU.

In at least some cases where an HU device like a smart phone, tablet,computing device, laptop, smart watch, etc., has the ability to storedata or to access data via the internet, a WIFI system or otherwise thatis stored on a local or remote (e.g., cloud) server, it is contemplatedthat every HU device or at least a subset used by specific HUs may storean HU voice model for an associated HU to be used by a captioningapplication or by any software application run by the HU device. Here,the HU model may be trained by one or more applications run on the HUdevice or by some other application like an ASR system associated withone of the captioning systems described herein that is run by an AUdevice, the relay server, or some third party server or processor. Here,for example, in one instance, an HU's voice model stored on an HU devicemay be used to drive a voice-to-text search engine input tool to providetext for an internet search independent of the captioning system. Themulti-use and perhaps multi-application trained HU voice model may alsobe used by a captioning ASR system during an AU-HU call. Here, the voicemodel may be used by an ASR application run on the HU device, run on theAU device, run by the relay server or run by a third party server.

In cases where an HU voice model is accessible to an ASR engineindependent of an HU device, when an AU device is used to place a callto an HU device, an HU model associated with the number called may beautomatically prepared for generating captions even prior to connectionto the HU device. Where a phone or other identifying number associatedwith an HU device can be identified prior to an AU answering a call fromthe HU device, again, an HU voice model associated with the HU devicemay be accessed and readied by the captioning system for use prior tothe answering action to expedite ASR text generation. Most people useone or a small number of phrases when answering an incoming phone call.Where an HU voice model is loaded prior to an HU answering a call, theASR engine can be poised to detect one of the small number of greetingphrases routinely used to answer calls and to compare the HU's voicesignal to the model to confirm that the voice model is for the specificHU that answers the call. If the HU's salutation upon answering the calldoes not match the voice model, the system may automatically link to aCA to start a CA controlled captioning process.

While at least some systems will include HU voice models, it should beappreciated that other systems may not and instead may rely on robustvoice to text software algorithms that train to specific voices overrelatively short durations so that every new call with an HU causes thesystem to rapidly train anew to a received HU voice signal. Forinstance, in many cases a voice model can be at least initially trainedwithin tens of seconds to specific voices after which the modelscontinue to train over the duration of a call to become more accurate asa call proceeds. In at least some of these cases there is no need forvoice model storage.

Presenting Captions for AU Voice Messages

While a captioning system must provide accurate text corresponding to anHU voice signal for an AU to view when needed, typical relay systems fordeaf and hard of hearing person would not provide a transcription of anAU's voice signal. Here, generally, the thinking has been that an AUknows what she says in a voice signal and an HU hears that signal andtherefore text versions of the AU's voice was not necessary. This,coupled with the fact that AU captioning would have substantiallyincreased the transcription burden on CAs (e.g., would have required CArevoicing or typing and correction of more voice signal (e.g., the AUvoice signal)) meant that AU voice signal transcription simply was notsupported. Another reason AU voice transcription was not supported wasthat at least some AUs, for privacy reasons, do not want both sides ofconversations with HUs being listened to by CAs.

In at least some embodiments, it is contemplated that the AU side of aconversation with an HU may be transcribed to text automatically via anASR engine and presented to the AU via a device display 18 while the HUside of the conversation is transcribed to text in the most optimal waygiven transcription triage rules or algorithms as described above. Here,the AU voice captions and AU voice signal would never be presented to aCA. Here, while AU voice signal text may not be necessary in some cases,in others it is contemplated that many AUs may prefer that text of theirvoice signals be presented to be referred back to or simply as anindication of how the conversation is progressing. Seeing both sides ofa conversation helps a viewer follow the progress more naturally. Here,while the ASR generated AU text may not always be extremely accurate,accuracy in the AU text is less important because, again, the AU knowswhat she said.

Where an ASR engine automatically generates AU text, the ASR engine maybe run by any of the system processors or devices described herein. Inparticularly advantageous systems the ASR engine will be run by the AUdevice 12 where the software that transcribes the AU voice to text istrained to the voice of the AU and therefore is extremely accuratebecause of the personalized training.

Thus, referring again to FIG. 1, for instance, in at least someembodiments, when an AU-HU call commences, the AU voice signal may betranscribed to text by AU device 12 and presented as shown at 822 inFIG. 26 without providing the AU voice signal to relay 16. The HU voicesignal, in addition to being audibly broadcast via AU device 12, may betransmitted in some fashion to relay 16 for conversion to text when sometype of CA assistance is required. Accurate HU text is presented ondisplay 18 at 820. Thus, the AU gets to see both AU text, albeit withsome errors, and highly accurate HU text. Referring again to FIG. 24, inat least some cases, AU and HU text may also be presented to an HU viaan HU device (e.g., a smart phone) in a fashion similar to that shown inFIG. 26.

Referring still to FIG. 26, where both HU and AU text are generated andpresented to an AU, the HU and AU text may be presented in staggeredcolumns as shown along with an indication of how each textrepresentation was generated (e.g., see titles at top of each column inFIG. 26).

In at least some cases it is contemplated that an AU may, at times, noteven want the HU side of a conversation to be heard by a CA for privacyreasons. Here, in at least some cases, it is contemplated that an AUdevice may provide a button or other type of selectable activator toindicate that total privacy is required and then to re-establish relayor CA captioning and/or correction again once privacy is no longerrequired. To this end, see the “Complete Privacy” button or virtual icon826 shown on the AU device display 18 in FIG. 26. Here, it iscontemplated that, while an AU-HU conversation is progressing and a CAgenerates/corrects text 820 for an HU's voice signal and an ASRgenerates HU text 822, if the AU wants complete privacy but still wantsHU text, the AU would select icon 826. Once icon 826 is selected, the HUvoice signal would no longer be broadcast to the CA and instead an ASRengine would transcribe the AU voice signal to automated text to bepresented via display 18. Icon 826 in FIG. 26 would be changed to “CACaption” or something to that effect to allow the AU to again start fullCA assistance when privacy is less of a concern.

In cases where an ASR engine generates confidence factors for ASRcaptioned words or phrases, the captioned device may indicate lowconfidence factor words or phrases to the AU indicating that the wordsor phrases are more likely than others to be inaccurate. Here, in atleast some cases it is contemplated that when a word is highlighted orotherwise visually distinguished or labelled to indicate low confidence,the captioned device will also present an option (e.g., selectable iconproximate the word) that an AU may select to temporarily link a CA tothe call to consider only the selected word and surrounding text forcontext. When this option is selected, a CA may be linked and the wordand surrounding text presented via the CA workstation display while theassociated HU voice signal is broadcast to the CA for consideration.Here, the CA may correct the word or may leave the initial ASR textunchanged to affirm accuracy. In still other cases where low confidenceis indicated for a word or phrase, where the ASR generates otherpossible options for that word or phrase, the captioned device maypresent one or more of those other options for consideration by the AU.Here the AU would simply sort out which option makes most sense or mayask the HU to clarify what was said.

In at least some cases it is contemplated that when an ASR generatesconfidence factors for ASR text, whether or not that ASR text isautomatically and immediately transmitted to an AU captioned device maybe a function of the confidence factor. For instance, where and ASR textconfidence factor is low, that text may not be transmitted to an AUdevice for display and instead may simply be presented to a CA for errorcorrection or confirmation while high confidence factor text may beautomatically and immediately transmitted to an AU captioned device tobe presented. Here, once a CA error corrects the text, the correctedtext is transmitted to the AU captioned device for in-line or othererror correction.

In some cases where an ASR text segment has a low confidence factor, alltext segments thereafter will be delayed until the low confidence textis corrected. In other cases where an ASR text segment has a lowconfidence factor, only that low confidence text would be delayed andany high confidence factor text subsequent thereto would automaticallybe transmitted to the AU captioned device for immediate display.

In other cases where an ASR generated text segment confidence factor islow, segment transmission to the AU captioned device for display may bedelayed for at least some time so that a CA may observe the text andcorrect any perceived errors. Here, the delay may be for a presetduration of time (e.g., 3-5 seconds) or may be based on other factorssuch as where the CA is currently making error corrections withinpresented text. Thus, for instance, where a CA is error correctingsubsequent to a low confidence text segment,

Other Triggers for Automated Catch Up Text

In addition to a voice-to-text lag exceeding a maximum lag time, theremay be other triggers for using ASR engine generated text to catch an AUup to an HU voice signal. For instance, in at least some cases an AUdevice may monitor for an utterance from an AU using the device and mayautomatically fill in ASR engine generated text corresponding to an HUvoice signal when any AU utterance is identified. Here, for example,where CA transcription is 30 seconds behind an HU voice signal, if an AUspeaks, it may be assumed that the AU has been listening to the HU voicesignal and is responding to the broadcast HU voice signal in real time.Because the AU responds to the up to date HU voice signal, there may beno need for an accurate text transcription for prior HU voice phrasesand therefore automated text may be used to automatically catch up. Inthis case, the CA's transcription task would simply be moved up in timeto a current real time HU voice signal automatically and the CA wouldnot have to consider the intervening 30 seconds of HU voice fortranscription or even correction. When the system skips ahead in the HUvoice signal broadcast to the CA, the system may present some clearindication that it is skipping ahead to the CA to avoid confusion. Forinstance, when the system skips ahead, a system processor may present asimultaneous warning on the CA display screen indicating that the systemis skipping intervening HU voice signal to catch the CA up to real time.

As another example, when an AU device or other system device recognizesa turn marker in an HU voice signal, all ASR generated text that isassociated with a lag time may be filled in immediately andautomatically.

As still one other instance, an AU device or other device may monitor AUutterances for some specific word or phrase intended to trigger anupdate of text associated with a lag time. For instance, the AU maymonitor for the word “Update” and, when identified, may fill in the lagtime with automated text. Here, in at least some cases, the AU may beprogrammed to cancel the catch-up word “Update” from the AU voice signalsent to the HU device. Thus, here, the AU utterance “Update” would havethe effect of causing ASR text to fill in a lag time without beingtransmitted to the HU device. Other commands may be recognized andautomatically removed from the AU voice signal.

Thus, it should be appreciated that various embodiments of asemi-automated automatic voice recognition or text transcription systemto aid hearing impaired persons when communicating with HUs have beendescribed. In each system there are at least three entities and at leastthree devices and in some cases there may be a fourth entity and anassociated fourth device. In each system there is at least one HU andassociated device, one AU and associated device and one relay andassociated device or sub-system while in some cases there may also be athird party provider (e.g., a fourth party) of ASR services operatingone or more servers that run ASR software. The HU device, at a minimum,enables an HU to annunciate words that are transmitted to an AU deviceand receives an AU voice signal and broadcasts that signal audibly forthe HU to hear.

The AU device, at a minimum, enables an AU to annunciate words that aretransmitted to an HU device, receives an HU voice signal and broadcaststhat signal (e.g., audibly, via Bluetooth where an AU uses a hearingaid) for the AU to attempt to hear, receives or generates transcribedtext corresponding to an HU voice signal and displays the transcribedtext to an AU on a display to view.

The relay, at a minimum, at times, receives the HU voice signal andgenerates at least corrected text that may be transmitted to anothersystem device.

In some cases where there is no fourth party ASR system, any of theother functions/processes described above may be performed by any of theHU device, AU device and relay server. For instance, the HU device insome cases may store an HU voice model and/or voice characteristicsmodel, an ASR application and a software program for managing whichtext, ASR or CA generated, is used to drive an AU device. Here, the HUmay link directly with each of the AU device and relay, and may operateas an intermediary therebetween.

As another instance, HU models, ASR software and caption controlapplications may be stored and used by the AU device processor or,alternatively, by the relay server. In still other instances differentsystem components or devices may perform different aspects of afunctioning system. For instance, an HU device may store an HU voicemodel which may be provided to an AU device automatically at thebeginning of a call and the AU device may transmit the HU voice modelalong with a received HU voice signal to a relay that uses the model totune an ASR engine to generate automated text as well as provides the HUvoice signal to a first CA for revoicing to generate CA text and asecond CA for correcting the CA text. Here, the relay may transmit andtranscribe text (e.g., automated and CA generated) to the AU device andthe AU device may then select one of the received texts to present viathe AU device screen. Here CA captioning and correction and transmissionof CA text to the AU device may be halted in total or in part at anytime by the relay or, in some cases, by the AU device, based on variousparameters or commands received from any parties (e.g., AU, HU, CA)linked to the communication.

In cases where a fourth party to the system operates an ASR engine inthe cloud or otherwise, at a minimum, the ASR engine receives an HUvoice signal at least some of the time and generates automated textwhich may or may not be used at times to drive an AU device display.

In some cases it is contemplated that ASR engine text (e.g., automatedtext) may be presented to an HU while CA generated text is presented toan AU and a most recent word presented to an AU may be indicated in thetext on the HU device so that the HU has a good sense of how far behindan AU is in following the HU's voice signal. To this end, see FIG. 27that shows an exemplary HU smart phone device 800 including a display801 where text corresponding to an HU voice signal is presented for theHU to view at 848. The text 848 includes text already presented to an AUprior to and including the word “after” that is shown highlighted 850 aswell as ASR engine generated text subsequent to the highlight 850 that,in at least the illustrated embodiment, may not have been presented tothe AU at the illustrated time. Here, an HU viewing display 801 can seewhere the AU is in receiving text corresponding to the HU voice signal.The HU may use the information presented as a coaching tool to help theHU regulate the speed at which the HU converses. In addition toindicating the most recent textual word presented to the AU, the mostrecent word audibly broadcast to the AU may be visually highlighted asshown at 847 as well.

In other cases, an HU device 800 may present other information to the HUindicating AU progress consuming the HU voice signal as a coachingfeature. For instance, an HU voice signal consumption meter 821 shown inFIG. 27 may indicate a current state of HU voice signal consumption bythe AU on a call. The meter 821 includes a scale 823 and a dynamicsliding consumption pointer 825 that slides along the scale 823 toindicate a duration of HU voice signal between the most recent HU voicesignal word captioned for the AU and a current time. Although not shown,the scale 823, pointer 825 or both may change color as the delay in HUvoice signal consumption changes (e.g., green to indicate relativelycaught up, red indicating a substantial delay, yellow indicating anintermediate duration delay, etc.).

In still other cases audible indications of delays in AU consumption ofthe HU voice signal may be presented as indicated at 827 where thephrase “slow down” is automatically broadcast to the HU via a speaker inthe HU phone device 800. Here, the broadcast may be faint in at leastsome embodiments. In still other cases device 800 may present a textmessage notice “Slow Down” as shown at 829 and/or control a hapticcomponent (e.g., a vibrator) 831 integrated into device 800 to indicatea need to slow down or wait until the AU catches up more to the currentHU voice signal.

Smart HU Device—Other System Arrangements

To be clear, where an HU device is a smart phone, laptop or some othertype of computing device that can run an application program toestablish and participate in a captioning service, many differentcommunication linking arrangements between the AU, HU and a relay arecontemplated and those linkages may be dynamic (e.g., the devices orsystem components may cooperate to switch communication linkages betweenparties and entities), automatically changed based on instantaneouslyrequired services as well as on other call and communication factors.This concept of dynamic system reconfiguration will be described in thecontext of the exemplary system 2000 shown in FIG. 61. Exemplarycaptioning system 2000 includes an AU's communication arrangement 2002,a relay captioning arrangement 2004 and an HU communication device 2010.In some embodiments, the system will also include a third party ASRprovider represented by ASR server 2006.

The exemplary AU communication arrangement 2002 is shown to include acaptioned device 2012 and a wireless portable computing device 2014. Inother embodiments the AU's arrangement 2002 may only include a singlecaptioned/telephone device or a network device (e.g., a wireless router)and a wireless computing device like a smart phone, a laptop, etc.System components illustrated in FIG. 61 outside AU communicationarrangement 2002 are shown with communication links to the AU'sarrangement 2002 generally as opposed to the separate devices 2012 and2014 to indicate that each of the other components may link to eitherone of the AU's devices 2012 or 2014, depending on system setup. Forinstance, where device 2014 is a wireless, portable cellular smartphone, all communications with devices outside the AU's arrangement 2002may be via the wireless cellular phone 2014 and the phone 2014 maycommunicate with captioned device 2012 so that the phone 2014 isolatescaptioned device 2012 from other system components.

In other cases, captioned device 2012 may isolate phone 2014 from othersystem devices and may allow the AU to use a microphone and speakerincluded in device 2014 for voice communications through device 2012while device 2012 presents text corresponding to HU voice signal on thedevice display screen.

In still other cases devices 2012 and 2014 may share communication tasksto link to system devices outside arrangement 2002. For instance, afirst AU-HU link for voice communication may be set up between wirelessportable device 2014 and the hearing user's device 2010 while a secondAU-relay link may be set up between captioned device 2002 and relay 2004on which device 2012 transmits HU voice signal to relay 2004 andreceives captions corresponding to the voice signal with wirelesscommunication between AU devices 2012 and 2014.

Hereinafter, unless indicated otherwise, the AU communicationarrangement 2002 will be referred to hereinafter as the AU's captioneddevice 2002 in the interest of simplifying this explanation, regardlessof the number of devices that comprise the AU's communicationarrangement. However, it should be appreciated that arrangement 2002 mayinclude two cooperating devices as shown in FIG. 61 or, in some cases,even more than two devices where additional devices include, forinstance, supplemental wireless speakers, a headphone set, one or moresupplemental emissive display screens, a wireless router, a video camerafor telepresence type communication, etc. Thus, hereafter, communicationwith any device in arrangement 2002 may include communication with anyof the devices within the arrangement and where there are twocommunication links or channels to arrangement 2002, each of those linksmay be to any one of the arrangement devices with inter-arrangementcommunication between the two or more devices as needed. For instance,where a description indicates that HU device 2010 links to the AUcaptioned device 2002, where the AU's arrangement includes both devices2012 and 2014, HU device 2010 may link to either of devices 2012 or 2014(or even both devices 2012 and 2014 in some cases), depending on theAU's system setup and operation.

Referring still to FIG. 61, HU communication device 2010 is shown as alaptop computer but may be any type of computing device that is capableof running software programs and linking via one or more communicationlinks to other system components and includes or has access to a speakerand to a microphone to facilitate voice communications between an HU andthe AU.

Relay 2004 includes a relay server 2016 and a plurality of CAworkstations (only one shown at 2018). Server 2016 links to other systemdevices and resources outside the relay sub-system and is also linked toCA workstation 2018. Server 2016 broadcasts HU voice signal to a CA atstation 2018 and receives data (e.g., captions, caption correction,depending on system arrangement) back from the workstation to forward onto AU captioned device 2002.

The remote third party (3P) provider ASR 2006 may be included in somesystems and not in others and, where included in a system, comprises anASR server 2006. ASR server 2006 receives HU voice signals from someother system device or resource, transcribes that voice signal to ASRcaptions and then transmits those ASR captions to one or more othersystem devices and resources to be consumed (e.g., presented on adisplay, edited to correct errors, etc.).

Referring yet again to FIG. 61, several different communication linksare shown and labelled “1” through “6”, each labelled link indicating adifferent communication link or channel that may be established betweendifferent system devices or resources. For instance, link “1” is shownbetween HU device 2010 and the AU's captioned device 2002. As otherinstances, link “2” is shown between AU captioned device 2002 and relay2004, link “3” is shown between relay 2002 and the remote ASR server2006, link “4” is shown between HU communication device 2010 and relay2004, link “5” is shown between HU communication device 2010 and thirdparty ASR server 2006 and link “6” is shown between third party ASRserver 2006 and the AU's captioned device 2002.

FIG. 61 also includes a link “7” that may be established within the AU'scommunication arrangement (e.g., between two or more AU communicationdevices). Although not shown in FIG. 61 it should be appreciated thatwhen AU device 2002 includes two separate devices (e.g., a captioneddevice and a cell phone or a wireless router and a cell phone or tabletcomputing device or other dual device combinations), any of thecommunication links 1, 2 or 6 as shown may be with either of the twodevices (e.g., link 1 to AU phone device 2014 for HU-AU voice and link 2to captioned device 2012 for captions) with the two AU devices connectedvia link 7 as appropriate or two of those links may be with one of thedevices (e.g., link 1 and link 2 to phone device 2014) with devices 2012and 2014 connected via link 7.

Each of the FIG. 61 links may be one way or two way communication linksand may transmit voice data, caption data, system control data, videodata, subsets of those data types or each of those data types, dependingon how the system is instantaneously set up. Each link may be any typeof communication line or connection including a POTS line, a VOIP orother network channel, a cellular or other wireless channel, etc. Someof the lines may be one type of communication line while others are adifferent type of communication line.

The FIG. 61 links 1 through 6 are exemplary and, in most cases, only asubset of the illustrated links will be supported by a specificcaptioning system 2000. For instance, in some cases only links 1, 2 and3 may be supported while in other cases, each of links 1 through 4 aresupported and in still other cases, only links 4 and 2 may be supportedduring captioning sessions. Other supported link subsets arecontemplated.

At a minimum, when HU-AU voice communication occurs, regardless ofwhether or not captioning service is provided, there has to be somecommunication path (e.g., one link or two series links) between HUdevice 2010 and AU device 2002 for voice communications.

When HU-AU voice communication is enhanced with captions provided by asystem component other than HU device 2010 or AU device 2002, there hasto be some communication link from the HU device 2010 that originatesthe HU voice signal to the captioning component to deliver the HU voicesignal to the captioning component as well as some communication linkfor delivering captions from the captioning component to the AU device2002 so that the captions can be presented to the AU. In some cases theHU-AU voice link, the HU voice to relay link and the relay caption to AUdevice link may each be a single link between two components while inother cases any of these links may be a dual link including first andsecond series links between components to deliver voice or captions to adestination component. For instance, referring again to FIG. 61, HUvoice signal may be delivered to AU device 2002 via single link 1 insome cases or, in other cases, may be delivered from HU device 2010through relay 2004 to AU device 2002 via series links 4 and 2. Asanother instance, HU voice may be delivered to relay 2004 via singlelink 4 or through AU device 2002 to relay 2004 via series links 1 and 2.In systems including third party ASR 2006, HU voice may be delivered toserver 2006 directly via link 5, through relay 2004 via links 4 and 3,via AU device 2002, or even through three series links 1, 2 and 3.

Communication links required to support captioning may be established atdifferent times in different systems. For instance, where relay 2004generates captions and CA error corrections during captioning sessions,in some cases the link(s) to provide HU voice signals to the relay mayonly be established after captioning service is required. In other caseswhere relay 2004 generates captions and CA error corrections duringcaptioning, links required to provide captioning may be establishedimmediately when any HU-AU voice call is initiated (e.g., upon an HUdialing an AU or an AU dialing an HU) or established (e.g., upon an HUanswering an AU call or an AU answering an HU call), regardless ofwhether or not captioning is to commence immediately. In this case,while communication links to support captioning may be established priorto a request for captioning service, a CA may not be assigned to thecall until an AU requests captioning service.

Referring again to FIG. 61, once one of the HU and AU devices is used toinitiate a call to the other device or to start a captioning service,either one of those devices may be programmed to control communicationlinkage paths within the illustrated system. For instance, in FIG. 61,assume an HU-AU voice call is progressing on link 1 when an AU requestscaptioning service. Here, when the AU device 2002 receives thecaptioning command, in some embodiments, AU device 2002 will establishlink 2 to relay 2004 for HU voice and captions. In other embodiments, AUdevice 2002 may transmit control signals to HU device 2002 causing thatdevice to establish link 4 directly to relay 2004 for HU and AU voicecommunication as well as establishing link 2 to relay 2004 for HU and AUvoice and captions. Here, link 1 may be disabled once links 4 and 2 areestablished. In this regard, HU device 2010 may store a relay phonenumber or other network address usable to establish link 4.

In still other embodiments, AU device 2002 may establish link 2 to relay2004 and transmit control signals to relay 2004 causing the relay serverto establish link 4 directly to HU device 2010. In this regard, AUcaptioned device 2002 may identify an HU phone number or other addressinformation at the beginning of a voice call that is transmitted on torelay 2004 which is usable by the relay to establish link 4.

In a similar fashion, other FIG. 61 links may be initiated or controlledby different ones of the system components illustrated.

Referring again to FIG. 61, in some systems, HU-AU voice communicationsmay always be via link 1 where HU voice signals are transmitted on link1 to AU captioned device 2002 which broadcast's the HU voice signal tothe AU and AU voice signals captured by an AU device microphone or thelike are transmitted back to HU device 2010 on the same link 1 to bebroadcast to the HU via device 2010.

Referring still to FIG. 61, during an ongoing HU-AU phone call, if theAU requests captioning service (e.g., selects a captioning button on thecaptioned device), in some embodiments already described in thisspecification, link 2 is established to provide the HU voice signal fromAU captioned device 2002 to relay 2004. ASR software at one or both ofrelay 2016 and ASR server 2006 and/or a CA at relay 2004 transcribes theHU voice signal to text which is transmitted back to AU captioned device2002 for display to the AU.

Referring again to FIG. 61, in a different exemplary captioningarrangement, when an HU and AU are linked directly via line 1 for twoway voice communication and relay captioning services are requestedduring an ongoing call, AU captioned device 2002 may transmit a controlsignal to HU device 2010 to cause HU device 2010 to establish a newcommunication link 4 to connect directly to relay 2004 and relay 2004may then establish line 2 to link to AU device 2004 so that relay 2006is located between AU and HU devices 2002 and 2006, respectively, withinthe communication network and all communications (e.g., voice andcaptions) then pass through relay 2004 between HU and AU devices 2010and 2002, respectively. Thus, here, pre-captioning, the HU-AU voice callproceeds through first line 1 alone and after captioning commences, thelinkage path in FIG. 28A is along lines 4 and 2 with HU voice signal tothe relay on line 4, the HU voice signal and HU voice captions fromrelay 2004 to AU captioned device 2002 for display and broadcast to theAU on line 2, AU voice signals from AU device 2002 to the relay 2004 online 2 and passing through the relay 2004 to HU device 2010 on line 4.

In this example, the system components cooperate to change thecommunication arrangement (e.g., the link paths are dynamic) so that theHU-AU voice call on line 1 continues between the AU and HU along adifferent communication or link path/route including lines 4 and 2. Oneadvantage to this captioning arrangement is that an HU voice signal withreduced noise can often be provided to relay 2004 if that voice signalonly travels along line 4 as opposed to along lines 1 and 2 to get tothe relay and that often results in more accurate and faster captioningand error correction.

In still another embodiment, a call may start with an HU and an AUcommunicating via voice only on line 1 and then, once captioning isrequested during an ongoing call, the HU device 2010 may link directlyto relay 2004 (see line 4 in FIG. 61) and relay 2004 may link directlyto AU device 2002 (see line 2) while the AU device to HU device link(see line 1) persists so that all communications, voice or data, aredirectly communicated from the device that generates the voice signal ordata to the component that consumes the voice signal or data withouthaving to pass through any other system device or resource (e.g., HU andAU voice signals are directly between HU and AU devices, HU voice signalis direct from HU device 2010 to relay 2004 and transcribed textassociated with the HU voice is directly passed from relay 2010 to AUdevice 2002 to be displayed to the AU).

In still other cases, referring again to FIG. 61, when an HU first callsAU captioned device 2002, prior to any request from the AU to commencecaptioning, one of the HU device 2010 or the AU captioned device 2002may be programmed to establish links 4 and 2 for HU and AU voicecommunications through the relay without captioning. At this point theHU and AU voices simply pass through the relay without captioning. Then,when captioning is requested, AU captioned device 2002 simply transmitsa command to the relay to initiate captioning service. For instance, insome cases software run by HU's device 2010 may be programmed torecognize when the HU calls an AU that at least periodically needscaptioning and, when that AU's number is called, may automatically linkto relay 2004 (e.g., call the relay or otherwise establish line 4) andprovide the AU captioned device number (or other identifier) to therelay so that the relay can use that number to establish link 2.

As another instance, in other embodiments software run by AU device 2002may be programmed to, when an incoming call is received, automaticallylink to relay 2004 (e.g., call the relay or otherwise establish link 2)and provide an HU device number (or other identifier) to the relay sothat the relay can use that number to establish link 4.

In still other cases an HU device may be programmed to call a relay whena number associated with an AU captioned device is called to establish alink with the relay. Here, the HU device would store the AU captioneddevice number as well as a corresponding relay number. Upon entry of theAU device number to commence a call to the AU, the HU device identifiesand dials the relay number and presents the AU device number to aprocessor at the relay when the relay goes off hook (e.g., answers theincoming call). The relay then dials the AU device number and createscommunication link 2 between the relay and the AU device fortransmitting HU voice and text to the AU device for display.

Where a relay is positioned between AU and HU communication devices,during captioning, if an AU wants to disable captioning, the AU mayselect a disable icon or other input tool via one of the AU's devicescausing the relay to cease captioning service. Here, in someembodiments, the HU-relay-AU links may persist for voice communicationsonly, at least until the AU again requests captioning service. In othercases where captioning is disabled, one of the HU and AU devices may beprogrammed to establish a different direct link with the other of the AUand HU devices for voice communication and the relay may be removed fromthe communication.

Referring still to FIG. 61, in systems that include a remote third partyserver 2006 that provides ASR service, other system components may linkto server 2006 in any of several different ways including directly orcircuitously through other system components and link paths. Forinstance, as described in other parts of this specification, the relayserver 2016 may establish link 3 directly to server 2006 when remote ASRcaptioning is required and HU voice signal as well as ASR captions maybe transmitted on link 3 between the linked servers. As anotherinstance, AU captioned device 2002 may link (see link 6) directly to ASRserver 2006 when automated captioning is required with HU voice andcaptions transmitted on link 6. In still other cases, HU device 2010 mayestablish link 5 directly to server 2006 for HU voice and ASR captiontransmission.

In still other cases HU voice may be provided to ASR server 2006 via onelink and captions may be transmitted from server 2006 to one or moreother system components via one or more other links. For example, inFIG. 61, HU voice signal may be provided to ASR server 2006 via link 5and ASR captions may be transmitted to relay 2004 and AU captioneddevice 2002 via each of links 3 and 6, respectively.

Thus, pre-captioning HU-AU voice communications may be restricted todirect HU-AU link 1 or may be indirect and pass through otherintervening system components along two or more series links (e.g.,links 4 and 2 in FIG. 61). HU-AU voice communications during captioningmay be direct on link 1 or pass through intervening system componentsand the HU-AU voice communication paths pre-captioning and duringcaptioning may be different (e.g., direct on link 1 prior to captioningand indirect via links 4 and 2 once captioning commences).

Similarly, HU voice delivery to relay 2004 may be direct via link 4 orcircuitous via links 1 and 2 or via other dual or more link paths andrelay generated data (e.g., captions, error corrections, etc.) deliveryto AU captioned device 2002 may be direct via link 2 or circuitous viadual or more series link paths.

Many different pre-caption and captioning linkages and dynamic linkchanges between system components are contemplated by the presentdisclosure. Table 1 below lists several different pre-caption link pathoptions and captioning link path options for HU-AU voice transmissionand text caption transmissions where different systems that areconsistent with various aspects of the present disclosure employdifferent link path subsets. The first column groups captioning systemsinto two general categories including (I) systems that do not employ aremote third party ASR (see again 2006 in FIG. 61) and (II) systems thatdo employ a third party ASR.

The second column in Table 1 (entitled “Pre-captioning”) indicatesdifferent pre-captioning HU-AU voice link path options (e.g., a separateoption on each line in each cell) for each of the two system categories(e.g., without and with 3P ASR 2006 in FIG. 61) in the first column. Forinstance, the second column indicates different link paths within theFIG. 61 system for transmitting HU and AU voice signals between HUdevice 2010 and AU device 2002. For example, for a system that does notinclude the third party ASR 2006, HU-AU voice link options include (i)link 1 or (ii) series links 4 and 2 and, for a system with a third partyASR 2006, listed link options include (i) link 1; (ii) series links 4and 2; (iii) series links 4, 3, 6 or (iv) series links 5 and 6.

The third column in Table 1 (entitled “During Captioning”) indicateslink path options (e.g., again, a separate option presented on each linein each cell) for different HU-AU voice transmission and captiontransmissions during caption assisted voice communications (e.g., aftercaptioning commences) for each of the two system categories in the firstcolumn. The third column includes six sub-columns, one for each voice ordata type transfer. For instance, a first sub-column is labelled “HU-AUvoice link” and cells thereunder indicate different link paths withinthe FIG. 61 system for transmitting HU and AU voice signals between HUdevice 2010 and AU device 2002. For example, for a system without thethird party ASR 2006, HU-AU voice link options include (i) link 1 or(ii) series link 4 and 2 and, for a system with a third party ASR 2006,listed link options include (i) link 1; (ii) series link 4 and 2; (iii)series links 4, 3, 6 and (iv) series links 5 and 6.

Similarly, the second through sixth sub-columns labelled “HU voice torelay”, “Relay captions to AU device”, “HU voice to 3P ASR”, “3P ASRcaptions to relay” and “3P ASR captions to AU device” each lists linkoptions for the data transfer listed in the heading. For instance, for asystem with a third party ASR system 2006, the third party ASR captionsmay be transmitted to the AU device via any of link paths including (i)6; (ii) 3, 2; (iii) 3, 4, 1 or (iv) 5, 1. In the interest of simplifyingthis explanation, the voice and data transfer type columns in Table 1have been labelled (1) through (7).

TABLE 1 Pre- During Captioning Caption Captioning (3) HU (4) Relay (5)HU (6) 3P ASR (7) 3P ASR System (1) HU-AU (2) HU-AU voice to captions tovoice to captions captions to Types voice link voice link relay AUdevice 3P ASR to relay AU device (I) Systems 1 4, 2 4 2 W/O 3P ASR 4, 21, 2 4, 1 (II) Systems 1 1 4 2 5 3 6 With 3P ASR 4, 2 4, 2 1, 2 4, 1 1,2, 3 5, 4 3, 2 4, 3, 6 4, 3, 6 5, 3 4, 3 6, 2 3, 4, 1 5, 6 5, 6 1, 6 5,1

Referring still to FIG. 61 and also to again to Table 1, the presentdisclosure contemplates any system that includes any combination of linkpaths from Table 1 including one path option from each cell in one ofthe category rows (I) or (II) (e.g., first category without 3P ASR 2006and second category with P ASR 2006). Thus, for instance, for captioningsystems that do not include a 3P ASR 2006, a first system may includelink paths 1; 1; 4 and 2 from columns (1), (2), (3) and (4),respectively, a second captioning system may include link paths 1; 1; 4and 4, 1 from columns (1), (2), (3) and (4), respectively, and a thirdcaptioning system may include link paths 4, 2; 4, 2; 4 and 4 fromcolumns (1), (2), (3) and (4), respectively. For captioning systems thatdo include a 3P ASR 2006, a first system may include link paths 1; 1; 4;2; 5; 3 and 6 from columns (1), (2), (3), (4), (5) and (6) respectively,second captioning system may include link paths 4,2; 4,2; 1,2; 4,1;1,2,3; 5,4 and 3,2 from columns (1), (2), (3), (4), (5) and (6),respectively, and a third captioning system may include link paths 1; 1;1,2; 2; 5; 3 and 6 from columns (1), (2), (3), (4), (5) and (6),respectively. Many other combinations of link paths from Table 1 (andany other possible link path) are contemplated, some that dynamicallychange when captioning starts and others that persist when captioning isturned on for an ongoing call.

Referring now to FIG. 28, a schematic is shown of an exemplarysemi-automated captioning system that is consistent with at least someaspects of the present disclosure. The system enables an HU using device14 to communicate with an AU using AU device 12 where the AU receivestext and HU voice signals via the AU device 12. Each of the HU and theAU link into a gateway server or other computing device 900 that islinked via a network of some type to a relay. HU voice signals are fedthrough a noise reducing audio optimizer to a 3 pole or path ASR switchdevice 904 that is controlled by an adaptive ASR switch controller 932to select one of first, second and third text generating processesassociated with switch output leads 940, 942 and 944, respectively. Thefirst text generating process is an automated ASR text process whereinan ASR engine generates text without any input (e.g., data entry,correction, etc.) from any CA. The second text generating process is aprocess wherein a CA 908 revoices an HU voice or types to generate textcorresponding to an HU voice signal and then corrects that text. Thethird text generating process is one wherein the ASR engine generatesautomated text and a correcting CA 912 makes corrections to theautomated text. In the second process, the ASR engine operates inparallel with the CA to generate automated text in parallel to the CAgenerated and corrected text.

Referring still to FIG. 28, with switch 904 connected to output lead940, the HU voice signal is only presented to ASR engine 906 whichgenerates automated text corresponding to the HU voice which is thenprovided to a voice to text synchronizer 910. Here, synchronizer 910simply passes the raw ASR text on through a correctable text window 916to the AU device 12.

Referring again to FIG. 28, with switch 904 connected to output lead942, the HU voice signal, in addition to being linked to the ASR engine,is presented to CA 908 for generating and correcting text viatraditional CA voice recognition 920 and manual correction tools 924 viacorrection window 922. Here, corrected text is provided to the AU device12 and is also provided to a text comparison unit or module 930. Rawtext from the ASR engine 906 is presented to comparison unit 930.Comparison unit 930 compares the two text streams received andcalculates an ASR error rate which is output to switch control 932.Here, where the ASR error rate is low (e.g., below some threshold),control 932 may be controlled to cut the text generating CA 908 out ofthe captioning process.

Referring still to FIG. 28, with switch 904 connected to output lead944, the HU voice signal, in addition to being linked to the ASR engine,is fed through synchronizer 910 which delays the HU voice signal so thatthe HU voice signal lags the raw ASR text by a short period (e.g., 2seconds). The delayed HU voice signal is provided to a CA 912 chargedwith correcting ASR text generated by engine 906. The CA 912 uses akeyboard or the like 914 to correct any perceived errors in the raw ASRtext presented in window 916. The corrected text is provided to the AUdevice 12 and is also provided to the text comparison unit 930 forcomparison to the raw ASR text. Again, comparison unit 930 generates anASR error rate which is used by control 932 to operate switch device904. The manual corrections by CA 912 are provided to a CA errortracking unit 918 which counts the number of errors corrected by the CAand compares that number to the total number of words generated by theASR engine 906 to calculate a CA correction rate for the ASR generatedraw text. The correction rate is provided to control 932 which uses thatrate to control switch device 904.

Thus, in operation, when an HU-AU call first requires captioning, in atleast some cases switch device 904 will be linked to output lead 942 sothat full CA transcription and correction occurs in parallel with theASR engine generating raw ASR text for the HU voice signal. Here, asdescribed above, the ASR engine may be programmed to compare the raw ASRtext and the CA generated text and to train to the HU's voice signal sothat, over a relatively short period, the error rate generated bycomparison unit 930 drops. Eventually, once the error rate drops belowsome rate threshold, control 932 controls device 940 to link to outputlead 944 so that CA 908 is taken out of the captioning path and CA 912is added. CA 912 receives the raw ASR text and corrects that text whichis sent on to the AU device 12. As the CA corrects text, the ASR enginecontinues to train to the HU voice using the corrected errors.Eventually, the ASR accuracy should improve to the point where thecorrection rate calculated by tracking unit 918 is below some threshold.Once the correction rate is below the threshold, control 932 may controlswitch 904 to link to output link 940 to take the CA 912 out of thecaptioning loop which causes the relatively accurate raw ASR text to befed through to the AU device 12. As described above in at least somecases the AU and perhaps a CA or the HU may be able to manually switchbetween captioning processes to meet preferences or to address perceivedcaptioning problems.

FIG. 28A illustrates a method or process 1800 in flow chart form whereina CA manually switches between an ASR caption and CA correctionsub-process (e.g., a “first mode” of operation) and a full CAcaptioning-correcting sub-process (e.g., a “second mode of operation”)that is consistent with at least some aspects of the present disclosure.In FIG. 28A the first mode of operation includes blocks 1802 through1820 and the second mode of operation includes blocks 1822 through 1836.At block 1802, during a call between an AU and an HU, as the HU speaks,the HU voice signal is transmitted to a captioning relay which providesthat HU voice signal to an ASR engine to generate ASR captionscorresponding to the HU voice signal. Here, the HU voice signal may betransmitted to the relay either from the HU communication device or fromthe AU caption or other AU communication device (e.g., an AU phonedevice linked or in communication with an AU captioned device, acomputer, a smart portable communication device, etc.). At block 1804,the ASR captions are transmitted to the AU captioned device forimmediate display. At block 1806 the ASR captions are presented to a CAat the relay via a CA workstation display screen and the HU voice signalis broadcast to the CA via a headset or the like. The CA corrects ASRcaption errors at block 180. At block 1810, CA error corrections aretransmitted to the AU captioned device which uses those errorcorrections to make in line or other corrections to the prior presentedASR captions.

Referring still to FIG. 28A, at block 1812, the CA error corrections aretracked by a system processor which generates error rate metrics thatreflect accuracy of the ASR captions/engine. At decision block 1814, theprocessor compares the instantaneous ASR caption error rate to athreshold level that represents an unacceptably high error rate. In thisregard, “unacceptably high” means that the error rate is such that theerror correction burden on the CA is high and likely slowing down theoverall captioning and correction process when compared to metrics forthe CA or CAs generally when the system operated via the second mode(e.g., where the CA generates initial captions as well as correctserrors in those captions without aid from an ASP). Thus, for instance,an exemplary threshold metric may be 10% of ASR captions requiringcorrection. In some cases the notification may be based on the errorrate so that a more urgent notification is presented when the error rateis relatively higher.

At block 1814, where the error rate is not greater than the thresholdlevel, control passes back up to block 1802 where the process describedabove continues. When the error rate exceeds the threshold level,control passes to block 1816 where the system processor generates analert, notification or other type of signal to suggest to the CA thatthe CA manually switch from the first mode (e.g., ASR captions-CAcorrections) to the second mode (e.g., full CA captions andcorrections). For instance, a text notification may be presented along alower edge of a display screen at a CA's workstation indicating, “Toomany ASR errors, advise you switch to full CA captioning andcorrections.” Other notification types are contemplated includingaudible, haptic and combination of audible, haptic and visual. Afterblock 1818, control passes to decision block 1820.

Referring again to FIG. 28A, at block 1820, the processor determines ifthe CA has manually elected to switch from the first to the secondmodes. In some cases even if the ASR caption error rate is high, a CAmay still opt to have the ASR continue generating initial captions forthe CA to correct. At block 1820, if the CA elects not to switch to thesecond mode, control passes back to block 1802 where the processdescribed above continues. If the CA elects to switch to the full CAcaption and correction mode, control passes to block 1822 where the HUvoice signal is broadcast to the CA for revoicing to trained voice totext software or for typing captions corresponding thereto which arepresented on the CA display screen for CA error correction.

In addition, in at least some cases when the CA elects to switch to thefull CA caption and correction mode, control also passes to block 1828where the HU voice signal is still provided to the ASR engine togenerate ASR text behind the scenes (e.g., for comparison to CA captionsand corrections but not to be presented on the CA display or AUcaptioned device).

Referring again FIG. 28A, at block 1824 the CA generates captions forthe HU voice and also corrects errors in those captions. At block 1826the CA captions are transmitted to the AU captioned device to bepresented and error corrections are subsequently sent to the captioneddevice for in line or other error correction.

After ASR text is generated at block 1828, control passes to block 1830where a system processor compares ASR captions to CA captions and errorcorrections to generate ASR accuracy metrics. At block 1832, theprocessor compares the ASR accuracy metrics to threshold accuracymetrics (e.g., 5% error rate over 30 seconds). Where the ASR qualitymetrics do not exceed the threshold metrics, control passes back up toblocks 1822 and 1828 where the process described above persists.

At block 1832, once the ASR accuracy metrics exceed the thresholdaccuracy metrics, control passes to block 1834 where the processorpresents an alert, notification, or other indicator as a suggestion tothe CA that the CA should at least consider switching back to the ASRcaptioning and CA correcting operating mode. At block 1836, if the CAswitched back to the first operating mode, control passed back up toblock 1802 as illustrated where the process described here continues tocycle. If the CA does not switch back to the first operating mode,control passes back up to block 1820 where the process that starts atblock 1820 in FIG. 28A continually cycles.

Thus, in the FIG. 28A system there are two general operating modesincluding an ASR captioning and CA correcting mode and a full CAcaptioning and correcting mode where a system processor assesses ASRcaption accuracy and provides guidance to a CA as to which mode islikely best given ASR caption accuracy and the CA manually switchesbetween the two modes when deemed appropriate. In some cases interfaceicons or other input means for switching between the two modes may notbe provided to the CA until the threshold accuracy metrics are met thattrigger the suggestion to change.

It at least some embodiments the system may persistently provideinterface icons or other input means enabling a CA to manually switchbetween the first and second operating modes at any time deemedappropriate by the CA. In this case, for instance, even where ASRcaptions are relatively accurate and do not exceed the threshold levelat block 1814, the CA may opt for the full CA captioning and correctingoperating mode.

In some cases it is contemplated that there may be two or more accuracythreshold levels and the system processor may operate differently toencourage a captioning mode change or to automatically cause a modechange based on which threshold is met. For instance, assume a casewhere a first ASR accuracy threshold is 10% (e.g., 10% of all captionedwords are erroneous) and a second accuracy threshold is 15% (e.g., 15%of all captioned words are erroneous). Here, a processor may present anotification to a CA suggesting a change from the first mode to thesecond once the first threshold has been met for some duration of time(e.g., a time factor of 30 seconds). If the error rate exceeds thesecond threshold for a second time duration (e.g., 20 seconds), theprocessor may automatically initiate a change to the second operatingmode. Thus, here, when the first threshold is met the CA is onlyencouraged to switch to the second mode but when the second threshold ismet the system automatically initiates the second mode irrespective ofthe CA's desire. A similar 2 threshold triage process may be implementedwhen moving from the second operating mode to the first operating mode.

At most times ASR captioning will be faster and more real time than CAcaptioning and therefore, it will usually be advantageous to transmitASR captions to an AU immediately in any system. In any of the abovecases (e.g., two-mode or other configurations), ASR text may always besent immediately upon generation to an AU captioned device for displayfollowed by several rounds of error correction based on theinstantaneously best caption information available from either an ASR ora CA. For example, where an ASR continues to generate ASR captions andautomatically contextually correct ASR generated captions in parallelwith a CA independently generating CA captions and manually correcting,the sequence of initial text to an AU and corrections may include (i)transmitting ASR txt to AU captioned device for display, (ii)transmitting ASR error corrections based on context to the AU captioneddevice for s first round of error corrections (assuming thesecorrections occur prior to CA error corrections (see (iv) hereafter)),(iii) transmitting CA generated (e.g., captioned from CA revoicing)captions or differences between those captions and the ASR captions tothe AU captioned device for a second round of error corrections, and(iv) transmitting CA error corrections to the AU captioned device todrive a third round of error corrections at the AU device.

Referring now to FIG. 28B, a flow chart 1840 illustrating a captioningsystem with three rounds of error corrections is illustrated. At thebeginning of the illustrated process, HU voice signal is simultaneouslyprovided to an ASR for ASR captioning at block 1854 and to a CA at arelay at block 1842. The ASR engine generates ASR captions correspondingto the HU voice signal which are, upon generation, immediatelytransmitted to an AU captioned device at 1856 for display to an AU atthe receiving device. At block 1858 the ASR engine continually usesperceived context in HU voice signal received to automatically identifylikely errors in the ASR captions and at block 1860, corrections forperceived errors are sent to the AU captioned device for in line orother error correction to the ASR test previously presented to the AU(e.g., first round or error correction).

Referring again to FIG. 28B, at block 1842 the CA receives the HU voicesignal and at block 1844 the CA revoices or otherwise acts to generateCA captions corresponding to the HU voice signal. At block 1846 a systemprocessor receives the CA captions, the ASR captions and the ASR captioncorrections and determines if the CA captions match the ASR captions andcorrections. In at least some cases CA captions are taken as truth andtherefore any mis-matching between CA and ASR captions or corrections isidentified as an error in the ASR captions or corrections. At block 1848CA captions that do not match the ASR captions or corrections aretransmitted as error corrections to the AU captioned device for thesecond round of in line error or other type of correction.

Referring still to FIG. 28B, at block 1850 the CA reviews the CAgenerated text on a display screen and makes manual error corrections tothat text which are transmitted to the AU captioned device to drive thethird round of error corrections on the display presented to the AU.

As described above, it has been recognized that at least some ASRengines are more accurate and more resilient during the first30+/−seconds of performing voice to text transcription. If an HU takes aspeaking turn that is longer than 30 seconds the engine has a tendencyto freeze or lag. To deal with this issue, in at least some embodiments,all of an HU's speech or voice signal may be fed into an audio bufferand a system processor may examine the HU voice signal to identify anysilent periods that exceed some threshold duration (e.g., 2 seconds).Here, a silent period would be detected whenever the HU voice signalaudio is out of a range associated with a typical human voice. When asilent period is identified, in at least some cases the ASR engine isrestarted and a new ASR session is created. Here, because the processuses an audio buffer, no portion of the HU's speech or voice signal islost and the system can simply restart the ASR engine after theidentified silent period and continue the captioning process afterremoving the silent period.

Because the ASR engine is restarted whenever a silent period of at leasta threshold duration occurs, the system can be designed to have severaladvantageous features. First, the system can implement a dynamic andconfigurable range of silence or gap threshold. For instance, in somecases, the system processor monitoring for a silent period of a certainthreshold duration can initially seek a period that exceeds some optimalrelatively long length and can reduce the length of the thresholdduration as the ASR captioning process nears a maximum period prior torestarting the engine. Thus, for instance, where a maximum ASR enginecaptioning period is 30 seconds, initially the silent period thresholdduration may be 3 seconds. However, after an initial 20 seconds ofcaptioning by an engine, the duration may be reduced to 1.5 seconds.Similarly, after 25 seconds of engine captioning, the threshold durationmay be reduced further to one half a second.

As another instance, because the system uses an audio buffer in thiscase, the system can “manufacture” a gap or silent period in which torestart an ASR engine, holding an HU's voice signal in the audio bufferuntil the ASR engine starts captioning anew. While the manufacturedsilent period is not as desirable as identifying a natural gap or silentperiod as described above, the manufactured gap is a viable option ifnecessary so that the ASR engine can be restarted without loss of HUvoice signal.

In some cases it is contemplated that a hybrid silent period approachmay be implemented. Here, for instance, a system processor may monitorfor a silent period that exceeds 3 seconds in which to restart an ASRengine. If the processor does not identify a suitable 3-plus secondperiod for restarting the engine within 25 seconds, the processor maywait until the end of any word and manufacture a 3 second period inwhich to restart the engine.

Where a silent period longer than the threshold duration occurs and theASR engine is restarted, if the engine is ready for captioning prior tothe end of the threshold duration, the processor can take out the end ofthe silent period and begin feeding the HU voice signal to the ASRengine prior to the end of the threshold period. In this way, theprocessor can effectively eliminate most of the silent period so thatcaptioning proceeds quickly.

Restarting an ASR engine at various points within an HU voice signal hasthe additional benefit of making all hypothesis words (e.g., initiallyidentified words prior to contextual correction based on subsequentwords) firm in at least some embodiments. Doing so allows a CAcorrecting the text to make corrections or any other manipulationsdeemed appropriate for an AU immediately without having to wait forautomated contextual corrections and avoids a case where a CA errorcorrection may be replaced subsequently by an ASR engine correction.

In still other cases other hybrid systems are contemplated where aprocessor examines an HU voice signal for suitably long silent periodsin which to restart an ASR engine and, where no such period occurs by acertain point in a captioning process, the processor commences anotherASR engine captioning process which overlaps the first process so thatno HU voice signal is lost. Here, the processor would work out whichcaptioned words are ultimately used as final ASR output during theoverlapping periods to avoid duplicative or repeated text.

Return on Audio Detector Feature

One other feature that may be implemented in some embodiments of thisdisclosure is referred to as a Return On Audio detector (ROA-Detector)feature. In this regard, a system processor receiving an HU voice signalascertains whether or not the signal includes audio in a range that istypical for human speech during an HU turn and generates a duration ofspeech value equal to the number of seconds of speech received. Thus,for instance, in a ten second period corresponding to an HU voice signalturn, there may be 3 seconds of silence during which audio is not in therange of typical human speech and therefore the duration of speech valuewould be 7 seconds. In addition, the processor detects the quantity ofcaptions being generated by an ASR engine. The processor automaticallycompares the quantity of captions from the ASR with the duration ofspeech value to ascertain if there is a problem with the ASR engine.Thus, for instance, if the quantity of ASR generated captions issubstantially less than would be expected given the duration of speechvalue, a potential ASR problem may be identified. The idea here is thatif the duration of speech value is low (e.g., 4 out of 10 seconds) whilethe caption quality value (based on CA error corrections or some otherfactor(s)) is also low, the low caption quality value is likely notassociated with the quantity of speech signal to be captioned andinstead is likely associated with an ASR problem. Where an ASR problemis likely, the likely problem may be used by the processor to trigger arestart of the ASR engine to generate a better result. As analternative, where an ASR problem is likely, the problem may triggerinitiation of a whole new ASR session. As still one other alternative, alikely ASR problem may trigger a process to bring a CA on lineimmediately or more quickly than would otherwise be the case.

In still other cases, when a likely ASR error is detected as indicatedabove, the ROA detector may retrieve the audio (i.e., the HU voicesignal) that was originally sent to the ASR from a rolling buffer andreplay/resend the audio to the ASR engine. This replayed audio would besent through a separate session simultaneously with any new sessionsthat are sending ongoing audio to the ASR. Here, the captionscorresponding to the replayed audio would be sent to the AU device andinserted into a correct sequential slot in the captions presented to theAU. In addition, here, the ROA detector would monitor the text thatcomes back from the ASR and compare that text to the text retrievedduring the prior session, modifying the captions to remove redundancies.Another option would be for the ROA to simply deliver a message to theAU device indicating that there was an error and that a segment of audiowas likely not properly captioned. Here, the AU device would present thelikely erroneous captions in some way that indicates a likely error(e.g., perhaps visually distinguished by a yellow highlight or thelike).

In some cases it is contemplated that a phone user may want to have justin time (JIT) captions on their phone or other communication device(e.g., a tablet) during a call with an HU for some reason. For instance,when a smart phone user wants to remove a smart phone from her ear for ashort period the user may want to have text corresponding to an HU'svoice presented during that period. Here, it is contemplated that avirtual “Text” or “Caption” button may be presented on the smart phonedisplay screen or a mechanical button may be presented on the devicewhich, when selected causes an ASR to generate text for a preset periodof time (e.g. 10 seconds) or until turned off by the device user. Here,the ASR may be on the smart phone device itself, may be at a relay or atsome other device (e.g., the HU's device). In other cases where a smartphone includes a motion sensor device or other sensor that can detectwhen a user moves the device away from her ear or when the user looks atthe device (e.g., a face recognition or eye gaze sensor), the system mayautomatically present text to the AU upon a specific motion (e.g.,pulling away from the user's ear) or upon recognizing that the user islikely looking at a display screen on the AU's device.

While HU voice profiles may be developed and stored for any HU callingan AU, in some embodiments, profiles may only be stored for a small setof HUs, such as, for instance, a set of favorites or contacts of an AU.For instance, where an AU has a list of ten favorites, HU voice profilesmay be developed, maintained, and morphed over time for each of thosefavorites. Here, again, the profiles may be stored at differentlocations and by different devices including the AU device, a relay, viaa third party service provider, or even an HU device where the HUearmarks certain AUs as having the HU as a favorite or a contact.

In some cases it may be difficult technologically for a CA to correctASR captions. Here, instead of a CA correcting captions, another optionwould simply be for a CA to mark errors in ASR text as wrong and movealong. Here, the error could be indicated to an AU via the display on anAU's device. In addition, the error could be used to train an HU voiceprofile and/or captioning model as described above. As anotheralternative, where a CA marks a word wrong, a correction engine maygenerate and present a list of alternative words for the CA to choosefrom. Here, using an on screen tool, the CA may select a correct wordoption causing the correction to be presented to an AU as well ascausing the ASR to train to the corrected word.

Metrics—Tracking and Reporting CA and ASR Accuracy

In at least some cases it is contemplated that it may be useful to runperiodic tests on CA generated text captions to track CA accuracy orreliability over time. For instance, in some cases CA reliabilitytesting can be used to determine when a particular CA could useadditional or specialized training. In other cases, CA reliabilitytesting may be useful for determining when to cut a CA out of a call tobe replaced by automatic speech recognition (ASR) generated text. Inthis regard, for instance, if a CA is less reliable than an ASRapplication for at least some threshold period of time, a systemprocessor may automatically cut the CA out even if ASR quality remainsbelow some threshold target quality level if the ASR quality ispersistently above the quality of CA generated text. As anotherinstance, where CA quality is low, text from the CA may be fed to asecond CA for either a first or second round of corrections prior totransmission to an AU device for display or, a second relatively moreskilled CA trained in handling difficult HU voice signals may be swappedinto the transcription process in order to increase the quality level ofthe transcribed text. As still one other instance, CA reliabilitytesting may be useful to a governing agency interested in tracking CAaccuracy for some reason.

In at least some cases it has been recognized that in addition toassessing CA captioning quality, it will be useful to assess howaccurately an automated speech recognition system can caption the sameHU voice signal regardless of whether or not the quality values are usedto switch the method of captioning. For instance, in at least some casesline noise or other signal parameters may affect the quality of HU voicesignal received at a relay and therefore, a low CA captioning qualitymay be at least in part attributed to line noise and other signalprocessing issues. In this case, an ASR quality value for ASR generatedtext corresponding to the HU voice signal may be used as an indicationof other parameters that affect CA captioning quality and therefore inpart as a reason or justification for a low CA quality value. Forinstance, where an ASR quality value is 75% out of 100% and a CA qualityvalue is 87% out of 100%, the low ASR quality value may be used to showthat, in fact, given the relatively higher CA quality value, that the CAvalue is quite good despite being below a minimum target threshold. Linenoise and other parameters may be measured in more direct ways via linesensors at a relay or elsewhere in the system and parameter valuesindicative of line noise and other characteristics may be stored alongwith CA quality values to consider when assessing CA caption quality.

Several ways to test CA accuracy and generate accuracy statistics arecontemplated by the present disclosure. One system for testing andtracking accuracy may include a system where actual or simulated HU-AUcalls are recorded for subsequent testing purposes and where HU turns(e.g., voice signal periods) in each call are transcribed and correctedby a CA to generate a true and highly accurate (e.g., approximately 100%accurate) transcription of the HU turns that is referred to hereinafteras the “truth”. Here, metrics on the HU voice message speed, dynamicduration of speech value, complexity of voice message words, quality ofvoice message signal, voice message pitch, tone, etc., can all bepredetermined and used to assess CA accuracy as well as to identifyspecific call types with specific characteristics that a CA does bestwith and others that the assistant has relatively greater difficultyhandling.

During testing, without a CA knowing that a test is being performed, thetest recording is presented to the CA as a new AU-HU call for captioningand the CA perceives the recording to be a typical HU-AU call. In manycases, a large number of recorded calls may be generated and stored foruse by the testing system so that a CA never listens to the same testrecording more than once. In some cases a system processor may track CAsand which test recordings the CA has been exposed to previously and mayensure that a CA only listens to any test recording once.

As a CA listens to a test recording, the CA transcribes the HU voicesignal to text and, in at least some cases, makes corrections to thetext. Because the CA generated text corresponds to a recorded voicesignal and not a real time signal, the text is not forwarded to an AUdevice for display. The CA is unaware that the text is not forwarded tothe AU device as this exercise is a test. The CA generated text iscompared to the truth and a quality value is generated for the CAgenerated text (hereinafter a “CA quality value”). For instance, the CAquality value may be a percent accuracy representing the percent of HUvoice signal words accurately transcribed to text. The CA quality valuemay also be affected by other factors like speed of the voice message,dynamic duration of speech value, complexity of voice message words,quality of voice message signal, voice message pitch, tone, etc.

In at least some cases different CA quality values may be generated fora single CA where each value is associated with a different subset ofvoice message and captioning characteristics. For instance, in a simplecase, a first CA may have a high caption quality value associated withhigh pitch voices and a relatively lower caption quality valueassociated with low pitch voices. The same first CA may have arelatively high caption quality value for high pitched voices where aduration of speech value is relatively low (e.g., less than 50%) whencompared to the quality value for a high pitched voice where theduration of speech value is relatively high (e.g., greater than 50%).Many other voice message characteristic subsets for qualifying captionquality values are contemplated.

The multiple caption quality values can be used to identify specificcall types with specific characteristics that a CA does best with andothers that the assistant has relatively greater difficulty handling.Incoming calls can be routed to CAs that are optimized (e.g., availableand highly effective for calls with specific characteristics) to handlethose calls. CA caption quality values and associated voice messagecharacteristics are stored in a data base for subsequent access.

In addition to generating one or more CA quality values that representhow accurately a CA transcribes voice to text, in at least some casesthe system will be programmed to track and record transcription latencythat can be used as a second type of quality factor referred tohereinafter as the “CA latency value”. Here, the system may trackinstantaneous latency and use the instantaneous values to generateaverage and other statistical latency values. For instance, an averagelatency over an entire call may be calculated, an average latency over amost recent one minute period may be calculated, a maximum latencyduring a call, a minimum latency during a call, a latency average takingout the most latent 20% and least latent 20% of a call may be calculatedand stored, etc. In some cases where both a CA quality value and CAlatency values are generated, the system may combine the quality andlatency values according to some algorithm to generate an overall CAservice value that reflects the combination of accuracy and latency.

CA latency may also be calculated in other ways. For instance, in atleast some cases a relay server may be programmed to count the number ofwords during a period that are received from an ASR service provider(see 1006 in FIG. 30) and to assume that the returned number of wordsover a minute duration represents the actual words per minute (WPM)spoken by an HU. Here, periods of HU silence may be removed from theperiod so that the word count more accurately reflects WPM of thespeaking HU. Then, the number of words generated by a CA for the sameperiod may be counted and used along with the period duration minussilent periods to determine a CA WPM count. The server may then comparethe HU's WPM to the CA WPM count to assess CA delay or latency.

Where actual calls are used to generate CA metrics, in at least somecases call content is not persistently stored as either voice or textfor subsequent access. Instead, in these cases, only audio, caption andcorrection timing information (e.g., delay durations) is stored for eachcall. In other cases, in addition to the timing information, callcharacteristics (e.g., Hispanic voice, HU WPM rate, line signal quality,HU volume, tone, etc.) and/or error types (e.g., visible, invisible,minor, etc.) for each corrected and missed error may be stored.

Where pre-recorded test calls are used to generate CA metrics, in atleast some cases in addition to storing the timing, call characters anderror types for each call, the system may store the complete text callaudio record with time stamps, captioning record and corrections recordso that a system administrator has the ability to go back and viewcaptioning and correction for an entire call to gain insights related toCA strengths and weaknesses.

In at least some cases the recorded call may also be provided to an ASRto generate automatic text. The ASR generated text may also be comparedto the truth and an “ASR quality value” may be generated. The ASRquality value may be stored in a database for subsequent use or may becompared to the CA quality value to assess which quality value is higheror for some other purpose. Here, also, an ASR latency value or ASRlatency values (e.g., max, min, average over a call, average over a mostrecent period, etc.) may be generated as well as an overall ASR servicevalue. Again, the ASR and CA values may be used by a system processor todetermine when the ASR generated text should be swapped in for the CAgenerated text and vice versa.

Referring now to FIG. 29, an exemplary system 1000 for testing andtracking CA and ASR quality and latency values using pre-recorded HU-AUcalls is illustrated. System 1000 includes relay components representedby the phantom box at 1001 and a cloud based ASR system 1006 (e.g., aserver that is linked to via the internet or some other type ofcomputing network). Two sources of pre-generated information aremaintained at the relay including a set of recorded calls at 1002 and aset of verified true transcripts at 1010, one truth or true transcriptfor each recorded call in the set 1002. Again, the recorded calls mayinclude actual HU-AU calls or may include mock calls that occur betweentwo knowing parties that simulate an actual call.

During testing, a connection is linked from a system server that storesthe calls 1002 to a captioning platform as shown at 1004 and one of therecorded calls, hereinafter referred to as a test recording, istransmitted to the captioning platform 1004. The captioning platform1004 sends the received test recording to two targets including a CA at1008 and the ASR server 1006 (e.g., Google Voice, IBM's Watson, etc.).The ASR generates an automated text transcript that is forwarded on to afirst comparison engine at 1012. Similarly, the CA generates CAgenerated text which is forwarded on to a second comparison engine 1014.The verified truth text transcript at 1010 is provided to each of thefirst and second comparison engines 1012 and 1014. The first engine 1012compares the ASR text to the truth and generates an ASR quality valueand the second engine 1014 compares the CA generated text to truth andgenerates a CA quality value, each of which are provided to a systemdatabase 1016 for storage until subsequently required.

In addition, in some cases, some component within the system 1000generates latency values for each of the ASR text and the CA generatedtext by comparing when the times at which words are uttered in the HUvoice signal to the times at which the text corresponding thereto isgenerated. The latency values are represented by clock symbols 1003 and1005 in FIG. 29. The latency values are stored in the database 1016along with the associated ASR and CA quality values generated by thecomparison engines 1012 and 1014.

Another way to test CA quality contemplated by the present disclosure isto use real time HU-AU calls to generate quality and latency values. Inthese cases, a first CA may be assigned to an ongoing HU-AU call and mayoperate in a conventional fashion to generate transcribed text thatcorresponds to an HU voice signal where the transcribed text istransmitted back to the AU device for display substantiallysimultaneously as the HU voice is broadcast to the AU. Here, the firstCA may perform any process to convert the HU voice to text such as, forinstance, revoicing the HU voice signal to a processor that runs voiceto text software trained to the voice of the HU to generate text andthen correcting the text on a display screen prior to sending the textto the AU device for display. In addition, the CA generated text is alsoprovided to a second CA along with the HU voice signal and the second CAlistens to the HU voice signal and views the text generated by the firstCA and makes corrections to the first CA generated text. Having beencorrected a second time, the text generated by the second CA is asubstantially error free transcription of the HU voice signal referredto hereinafter as the “truth”. The truth and the first CA generated textare provided to a comparison engine which then generates a “CA qualityvalue” similar to the CA quality value described above with respect toFIG. 29 which is stored for subsequent access in a database.

In addition, as is the case in FIG. 29, in the case of transcribing anongoing HU-AU call, the HU voice signal may also be provided to a cloudbased ASR server or service to generate automated speech recognitiontext during an ongoing call that can be compared to the truth (e.g., thesecond CA generated text) to generate an ASR quality value. Here, whileconventional ASRs are fast, there will again be some latency in textgeneration and the system will be able to generate an ASR latency value.

Referring now to FIG. 30, an exemplary system 1020 for testing andtracking CA and ASR quality and latency values using ongoing HU-AU callsis illustrated. Components in the FIG. 30 system 1020 that are similarto the components described above with respect to FIG. 29 are labeledwith the same numbers and operate in a similar fashion unless indicatedotherwise hereafter. In addition to an HU communication device 1040 andan AU communication device 1042 (e.g., a caption type telephone device),system 1020 includes relay components represented by the phantom box at1021 and a cloud based ASR system 1006 akin to the cloud based systemdescribed above with respect to FIG. 29. Here there is no pre-generatedand recorded call or pre-generated truth text as testing is done usingan ongoing dynamic call. Instead, a second CA at 1030 corrects textgenerated by a first CA at 1008 to create a truth (e.g., essentially100% accurate text). The truth is compared to ASR generated text and thefirst CA generated text to create quality values to be stored indatabase 1016.

Referring still to FIG. 30, during testing, as in a conventional relayassisted captioning system, the AU device 1042 transmits an HU voicesignal to the captioning platform at 1004. The captioning platform 1004sends the received HU voice signal to two targets including a first CAat 1008 and the ASR server 1006 (e.g., Google Voice, IBM's Watson,etc.). The ASR generates an automated text transcript that is forwardedon to a first comparison engine at 1012. Similarly, the first CAgenerates CA generated text which is transmitted to at least threedifferent targets. First, the first CA generated text which may includetext corrected by the first CA is transmitted to the AU device 1042 fordisplay to the AU during the call. Second, the first CA generated textis transmitted to the second comparison engine 1014. Third, the first CAgenerated text is transmitted to a second CA at 1030. The second CA at1030 views the CA generated text on a display screen and also listens tothe HU voice signal and makes corrections to the first CA generated textwhere the second CA generated text operates as a truth text or truth.The truth is transmitted to the second comparison engine at 1014 to becompared to the first CA generated text so that a CA quality value canbe generated. The CA quality value is stored in database 1016 along withone or more CA latency values.

Referring again to FIG. 30, the truth is also transmitted from thesecond CA at 1030 to the first comparison engine at 1012 to be comparedto the ASR generated text so that an ASR quality value is generatedwhich is also stored along with at least one ASR latency value in thedatabase 1016.

Referring to FIG. 31, another embodiment of a testing relay system isshown at 1050 which is similar to the system 1020 of FIG. 30, albeitwhere the ASR service 1006 provides an initial text transcription to thesecond CA at 1052 instead of the CA receiving the initial text from thefirst CA. Here, the second CA generated the truth text which is againprovided to the two comparison engines at 1012 and 1014 so that ASR andCA quality factors can be generated to be stored in database 1016.

The ASR text generation and quality testing processes are describedabove as occurring essentially in real time as a first CA generates textfor a recorded or ongoing call. Here, real time quality and latencytesting may be important where a dynamic triage transcription process isoccurring where, for instance, ASR generated text may be swapped in fora cut out CA when ASR generated text achieves some quality threshold ora CA may be swapped in for ASR generated text if the ASR quality valuedrops below some threshold level. In other cases, however, qualitytesting may not need to be real time and instead, may be able to be doneoff line for some purposes. For instance, where quality testing is onlyused to provide metrics to a government agency, the testing may be doneoff line.

In this regard, referring again to FIG. 29, in at least some cases wheretesting cannot be done on the fly as a CA at 1008 generates text, the CAtext and the recorded HU voice signal associated therewith may be storedin database 1016 for subsequent access for generating the ASR text at1006 as well as for comparing the CA generated text and the ASRgenerated text to the verified truth text from 1010. Similarly,referring again to FIG. 30, where real time quality and latency valuesare not required, at least the HU portion of a call may be stored indatabase 1016 for subsequent off line processing by ASR service 1006 andthe second CA at 1030 and then for comparisons to the truth at engines1012 an 1014.

It should be appreciated that current there are Federal and stateregulations that prohibit storage of any parts of voice communicationsbetween two or more people without authorization from at least one ofthose persons. For this reason, in at least some cases it iscontemplated that real voice recordings of AU-HU calls may only be usedfor training purposes after authorization is sought and received. Here,the same recording may be used to train multiple CAs. In other cases,“fake” AU-HU call recordings may be generated and used for trainingpurposes so that regulations and AU and HU privacy concerns cannot beviolated. Here, true transcripts of the fake calls can be generated andstored for use in assessing CA caption quality. One advantage of fakecall records is that different qualities of HU voice signals can besimulated automatically to see how those affect CA caption accuracyspeed, etc. For instance, a first CA may be much more accurate andfaster than a second CA at captioning standard or poor definition orquality voice signals.

One advantage of generating quality and latency values in real timeusing real HU-AU calls is that there is no need to store calls forsubsequent processing. Currently there are regulations in at least somejurisdictions that prohibit storing calls for privacy reasons andtherefore off line quality testing cannot be done in these cases.

In at least some embodiments it is contemplated that quality and latencytesting may only be performed sporadically and generally randomly sothat generated values are sort of an average representation of theoverall captioning service. In other cases, while quality and latencytesting may be periodic in general, it is contemplated that tell tailsigns of poor quality during transcription may be used to triggeradditional quality and latency testing. For instance, in at least somecases where an AU is receiving ASR generated text and the AU selects anoption to link to a CA for correction, the AU request may be used as atrigger to start the quality testing process on text received from thatpoint on (e.g., quality testing will commence and continue for HU voicereceived as time progresses forward). Similarly, when an AU requestsfull CA captioning (e.g., revoicing and text correction), qualitytesting may be performed from that point forward on the CA generatedtext.

In other cases, it is contemplated that an HU-AU call may be storedduring the duration of the call and that, at least initially, no qualitytesting may occur. Then, if an AU requests CA assistance, in addition topatching a CA into the call to generate higher quality transcription,the system may automatically patch in a second CA that generates truthtext as in FIG. 30 for the remainder of the call. In addition orinstead, when the AU requests CA assistance, the system may, in additionto patching a CA in to generate better quality text, also cause therecorded HU voice prior to the request to be used by a second CA togenerate truth text for comparison to the ASR generated text so that anASR quality value for the text that caused the AU to request assistancecan be generated. Here, the pre-CA assistance ASR quality value may begenerated for the entire duration of the call prior to the request orjust for a most recent sub-period (e.g., for the prior minute or 30seconds). Here, in at least some cases, it is contemplated that thesystem may automatically erase any recorded portion of an HU-AU callimmediately after any quality values associated therewith have beencalculated. In cases where quality values are only calculated for a mostrecent period of HU voice signal, recordings prior thereto may be erasedon a rolling basis.

As another instance, in at least some cases it is contemplated thatsensors at a relay may sense line noise or other signal parameters and,whenever the line noise or other parameters meet some threshold level,the system may automatically start quality testing which may persistuntil the parameters no longer meet the threshold level. Here, there maybe hysteresis built into the system so that once a threshold is met, atleast some duration of HU voice signal below the threshold is requiredto halt the testing activities. The parameter value or condition orcircumstance that triggered the quality testing would, in this case, bestored along with the quality value and latency information to addcontext to why the system started quality testing in the specificinstance.

As one other example, in a case where an AU signals dissatisfaction witha captioning service at the end of a call, quality testing may beperformed on at least a portion of the call. To this end, in at leastsome cases as an HU-AU call progresses, the call may be recordedregardless of whether or not ASR or CA generated text is presented to anAU. Then, at the end of a call, a query may be presented to the AUrequesting that the AU rate the AU's satisfaction with the call andcaptioning on some scale (e.g., a 1 through 10 quality scale with 10being high). Here, if a satisfaction rating were low (e.g., less than 7)for some reason, the system may automatically use the recorded HU voiceor at least a portion thereof to generate a CA quality value in one ofthe ways described above. For instance, the system may provide the textgenerated by a first CA or by the ASR and the recorded HU voice signalto a second CA for generating truth and a quality value may be generatedusing the truth text for storage in the database.

In still other cases where an AU expresses a low satisfaction rating fora captioning service, prior to using a recorded HU voice signal togenerate a quality value, the system server may request authorization touse the signal to generate a captioning quality value. For instance,after an AU indicates a 7 (out of 10) or lower on a satisfaction scale,the system may query the AU for authorization to check captioningquality by providing a query on the AU's device display and “Yes” and“No” options. Here, if the yes option is selected, the system wouldgenerate the captioning quality value for the call and memorialize thatvalue in the system database 1016. In addition, if the system identifiessome likely factor in a low quality assessment, the system maymemorialize that factor and present some type of feedback indicating thefactor as a likely reason for the low quality value. For instance, ifthe system determines that the AU-HU link was extremely noisy, thatfactor may be memorialized and indicated to the AU as a reason for thepoor quality captioning service.

As another instance, because it is the HU's voice signal that isrecorded (e.g., in some cases the AU voice signal may not be recorded)and used to generate the captioning quality value, authorization to usethe recording to generate the quality value may be sought from an HU ifthe HU is using a device that can receive and issue an authorizationrequest at the end of a call. For instance, in the case of a call wherean HU uses a standard telephone, if an AU indicates a low satisfactionrating at the end of a call, the system may transmit an audio recordingto the HU requesting authorization to use the HU voice signal togenerate the quality value along with instructions to select “1” for yesand “2” for no. In other cases where an HU's device is a smart phone orother computing type device, the request may include text transmitted tothe HU device and selectable “Yes” and “No” buttons for authorizing ornot.

While an HU-AU call recording may be at least temporarily stored at arelay, in other cases it is contemplated that call recordings may bestored at an AU device or even at an HU device until needed to generatequality values. In this way, an HU or AU may exercise more control or atleast perceive to exercise more control over call content. Here, forinstance, while a call may be recorded, the recording device may notrelease recordings unless authorization to do so is received from adevice operator (e.g., an HU or an AU). Thus, for instance, if the HUvoice signal for a call is stored on an HU device during the call and,at the end of a call an AU expresses low satisfaction with thecaptioning service in response to a satisfaction query, the system mayquery the HU to authorize use of the HU voice to generate captioningquality values. In this case, if the HU authorizes use of the HU voicesignal, the recorded HU voice signal would be transmitted to the relayto be used to generate captioning quality values as described above.Thus, the HU or AU device may serve as a sort of software vault for HUvoice signal recordings that are only released to the relay after properauthorization is received from the HU or the AU, depending on systemrequirements.

As generally known in the industry, voice to text software accuracy ishigher for software that is trained to the voice of a speaking person.Also known is that software can train to specific voices over shortdurations. Nevertheless, in most cases it is advantageous if softwarestarts with a voice model trained to a particular voice so that captionaccuracy can start immediately upon transcription. Thus, for instance,in FIG. 30, when a specific HU calls an AU to converse, it would beadvantageous if the ASR service at 1006 had access to a voice model forthe specific HU. One way to do this would be to have the ASR service1006 store voice models for at least HUs that routinely call an AU(e.g., a top ten HU list for each AU) and, when an HU voice signal isreceived at the ASR service, the service would identify the HU voicesignal either using recognition software that can distinguish once voicefrom others or via some type of an identifier like the phone number ofthe HU device used to call the AU. Once the HU voice is identified, theASR service accesses an HU voice model associated with the HU voice anduses that model to perform automated captioning.

One problem with systems that require an ASR service to store HU voicemodels is that HUs may prefer to not have their voice models stored bythird party ASR service providers or at least to not have the modelsstored and associated with specific HUs. Another problem may be thatregulatory agencies may not allow a third party ASR service provider tomaintain HU voice models or at least models that are associated withspecific HUs. Once solution is that no information useable to associatean HU with a voice model may be stored by an ASR service provider. Here,instead of using an HU identifier like a phone number or other networkaddress associated with an HU's device to identify an HU, an ASR servermay be programmed to identify an HU's voice signal from analysis of thevoice signal itself in an anonymous way. It is contemplated that voicemodels may be developed for every HU that calls an AU and may be storedin the cloud by the ASR service provider. Even in cases where there arethousands of stored voice models, an HU's specific model should bequickly identifiable by a processor or server.

Another solution may be for an AU device to store HU voice models forfrequent callers where each model is associated with an HU identifierlike a phone number or network address associated with a specific HUdevice. Here, when a call is received at an AU device, the AU deviceprocessor may use the number or address associated with the HU device toidentify which voice model to associate with the HU device. Then, the AUdevice may forward the HU voice model to the ASR service provider 1006to be used temporarily during the call to generate ASR text. Similarly,instead of forwarding an HU voice model to the ASR service provider, theAU device may simply forward an intermediate identification number orother identifier associated with the HU device to the ASR provider andthe provider may associate the number with a specific HU voice modelstored by the provider to access an appropriate HU voice model to usefor text transcription. Here, for instance, where an AU supports tendifferent HU voice models for 10 most recent HU callers, the models maybe associated with number 1 through 10 and the AU may simply forward onone of the intermediate identifiers (e.g., “7”) to the ASR provider 1006to indicate which one of ten voice models maintained by the ASR providerfor the AU to use with the HU voice transmitted.

In other cases an ASR may develop and store voice models for each HUthat calls a specific AU in a fashion that correlates those models withthe AU's identity. Then when the ASR provider receives a call from andAU caption device, the ASR provider may identify the AU and associatedHU voice models and use those models to identify the HU on the call andthe model associated therewith.

In still other cases an HU device may maintain one or more HU voicemodels that can be forwarded on to an ASR provider either through therelay or directly to generate text.

Visible and Invisible Voice to Text Errors

In at least some cases other more complex quality analysis andstatistics are contemplated that may be useful in determining betterways to train CAs as well as in assessing CA quality values. Forinstance, it has been recognized that voice to text errors can generallybe split into two different categories referred to herein as “visible”and “invisible” errors. Visible errors are errors that result in textthat, upon reading, is clearly erroneous while invisible errors areerrors that result in text that, despite the error that occurred, makessense in context. For instance, where an HU voices the phrase “We aremeeting at Joe's restaurant at 9 PM”, in a text transcription “We aremeeting at Joe's rodent for pizza at 9 PM”, the word “rodent” is a“visible” error in the sense that an AU reading the phrase would quicklyunderstand that the word “rodent” makes no sense in context. On theother hand, if the HU's phrase were transcribed as “We are meeting atJoe's room for pizza at 9 PM”, the erroneous word “room” is notcontextually wrong and therefore cannot be easily discerned as an error.Where the word “restaurant” is erroneously transcribed as “room”, an AUcould easily get a wrong impression and for that reason invisible errorsare generally considered worse than visible errors.

In at least some cases it is contemplate that some mechanism fordistinguishing visible and invisible text transcription errors may beincluded in a relay quality testing system. For instance, where 10errors are made during some sub-period of an HU-AU call, three of theerrors may be identified as invisible while 7 are visible. Here, becauseinvisible errors typically have a worse effect on communicationeffectiveness, statistics that capture relative numbers of invisible toall errors should be useful in assessing CA or ASR quality.

In at least some systems it is contemplated that a relay server may beprogrammed to automatically identify at least visible errors so thatstatistics related thereto can be captured. For instance, the server maybe able to contextually examine text and identify words of phrases thatsimply make no sense and may identify each of those nonsensical errorsas a visible error. Here, because invisible errors make contextualsense, there is no easy algorithm by which a processor or server canidentify invisible errors. For this reason in at least some cases acorrecting CA (see 1053 in FIG. 31) may be required to identifyinvisible errors or, in the alternative, the system may be programmed toautomatically use CA corrections to identify invisible errors. In thisregard, any time a CA changes a word in a text phrase that initiallymade sense within the phrase to another word that contextually makessense in the phrase, the system may recognize that type of correction tohave been associated with an invisible error.

In at least some cases it is contemplated that the decision to switchcaptioning methods may be tied at least in part to the types of errorsidentified during a call. For instance, assume that a CA is currentlygenerating text corresponding to an HU voice signal and that an ASR iscurrently training to the HU voice signal but is not currently at a highenough quality threshold to cut out the CA transcription process. Here,there may be one threshold for the CA quality value generally andanother for the CA invisible error rate where, if either of the twothresholds are met, the system automatically cuts the CA out. Forexample, the threshold CA quality value may require 95% accuracy and theCA invisible error rate may be 20% coupled with a 90% overall accuracyrequirement. Thus, here, if the invisible error rate amounts to 20% orless of all errors and the overall CA text accuracy is above 90% (e.g.,the invisible error rate is less than 2% of all words uttered by theHU), the CA may be cut out of the call and ASR text relied upon forcaptioning. Other error types are contemplated and a system fordistinguishing each of several errors types from one another forstatistical reporting and for driving the captioning triage process arecontemplated.

In at least some cases when to transition from CA generated text to ASRgenerated text may be a function of not just a straight up comparison ofASR and CA quality values and instead may be related to both quality andrelative latency associated with different transcription methods. Inaddition, when to transition in some cases may be related to acombination of quality values, error types and relative latency as wellas to user preferences.

Other triage processes for identifying which HU voice to text methodshould be used are contemplated. For instance, in at least someembodiments when an ASR service or ASR software at a relay is being usedto generate and transmit text to an AU device for display, if an ASRquality value drops below some threshold level, a CA may be patched into the call in an attempt to increase quality of the transcribed text.Here, the CA may either be a full revoicing and correcting CA, just acorrecting CA that starts with the ASR generated text and makescorrections or a first CA that revoices and a second CA that makescorrections. In a case where a correcting CA is brought into a call, inat least some cases the ASR generated text may be provided to the AUdevice for display at the same time that the ASR generated text is sentto the CA for correction. In that case, corrected text may betransmitted to the AU device for in line correction once generated bythe CA. In addition, the system may track quality of the CA correctedtext and store a CA quality value in a system database.

In other cases when a CA is brought into a call, text may not betransmitted to the AU device until the CA has corrected that text andthen the corrected text may be transmitted.

In some cases, when a CA is linked to a call because the ASR generatedtext was not of a sufficiently high quality, the CA may simply startcorrecting text related to HU voice signal received after the CA islinked to the call. In other cases the CA may be presented with textassociated with HU voice signal that was transcribed prior to the CAbeing linked to the call for the CA to make corrections to that text andthen the CA may continue to make corrections to the text as subsequentHU voice signal is received.

Thus, as described above, in at least some embodiments an HU'scommunication device will include a display screen and a processor thatdrives the display screen to present a quality indication of thecaptions being presented to an AU. Here, the quality characteristic mayinclude some accuracy percentage, the actual text being presented to theAU, or some other suitable indication of caption accuracy or an accuracyestimation. In addition, the HU device may present one or more optionsfor upgrading the captioning quality such as, for instance, requestingCA correction of automated text captioning, requesting CA transcriptionand correction, etc.

Time Stamping Voice and Text

In at least some embodiments described above various HU voice delayconcepts have been described where an HU's voice signal broadcast isdelayed in order to bring the voice signal broadcast more temporally inline with associated captioned text. Thus, for instance, in a systemthat requires at least three seconds (and at times more time) totranscribe an HU's voice signal to text for presentation, a systemprocessor may be programmed to introduce a three second delay in HUvoice broadcast to an AU to bring the HU voice signal broadcast moreinto simultaneous alignment with associated text generated by thesystem. As another instance in a system where an ASR requires at leasttwo seconds to transcribe an HU's voice signal to text for presentationto a correcting CA, the system processor may be programmed to introducea two second delay in the HU voice that is broadcast to an AU to bringthe HU voice signal broadcast for into temporal alignment with the ASRgenerated text.

In the above examples, the three and two second delays are simply basedon the average minimum voice-to-text delays that occur with a specificvoice to text system and therefore, at most times, will only impreciselyalign an HU voice signal with corresponding text. For instance, in acase where HU voice broadcast is delayed three seconds, if texttranscription is delayed ten seconds, the three second delay would beinsufficient to align the broadcast voice signal and text presentation.As another instance, where the HU voice is delayed three seconds, if atext transcription is generated in one second, the three second delaywould cause the HU voice to be broadcast two seconds after presentationof the associated text. In other words, in this example, the threesecond HU voice delay would be too much delay at times and too little atother times and misalignment could cause AU confusion.

In at least some embodiments it is contemplated that a transcriptionsystem may assign time stamps to various utterances in an HU's voicesignal and those time stamps may also be assigned to text that is thengenerated from the utterances so that the HU voice and text can beprecisely synchronized per user preferences (e.g., precisely aligned intime or, if preferred by an AU, with an HU's voice preceding or delayedwith respect to text by the same persistent period) when broadcast andpresented to the AU, respectively. While alignment per an AU'spreferences may cause an HU voice to be broadcast prior to or afterpresentation of associated text, hereinafter, unless indicatedotherwise, it will be assumed that an AU's preference is that the HUvoice and related text be broadcast and presented simultaneously atsubstantially the same time (e.g., within 1-2 seconds before or after).It should be recognized that in any embodiment described hereafter wherethe description refers to aligned or simultaneous voice and text, thesame teachings will be applicable to cases where voice and text arepurposefully misaligned by a persistent period (e.g., always misalignedby 3 seconds per user preference).

Various systems are contemplated for assigning time stamps to HU voicesignals and associated text words and/or phrases. In a first relativelysimple case, an AU device that receives an HU voice signal may assignperiodic time stamps to sequentially received voice signal segments andstore the HU voice signal segments along with associated time stamps.The AU device may also transmit at least an initial time stamp (e.g.corresponding to the beginning of the HU voice signal or the beginningof a first HU voice signal segment during a call) along with the HUvoice signal to a relay when captioning is to commence.

In at least some embodiments the relay stores the initial time stamp inassociation with the beginning instant of the received HU voice signaland continues to store the HU voice signal as it is received. Inaddition, the relay operates its own timer to generate time stamps foron-going segments of the HU voice signal as the voice signal is receivedand the relay generated time stamps are stored along with associated HUvoice signal segments (e.g., one time stamp for each segment thatcorresponds to the beginning of the segment). In a case where a relayoperates an ASR engine or taps into a fourth party ASR service (e.g.,Google Voice, IBM's Watson, etc.) where a CA checks and corrects ASRgenerated text, the ASR engine generates automated text for HU voicesegments in real time as the HU voice signal is received.

A CA computer at the relay simultaneously broadcasts the HU voicesegments and presents the ASR generated text to a CA at the relay forcorrection. Here, the ASR engine speed will fluctuate somewhat based onseveral factors that are known in the speech recognition art so that itcan be assumed that the ASR engine will translate a typical HU voicesignal segment to text within anywhere between a fraction of a second(e.g., one tenth of a second) to 10 seconds. Thus, where the CA computeris configured to simultaneously broadcast HU voice and present ASRgenerated text for CA consideration, in at least some embodiments therelay is programmed to delay the HU voice signal broadcast dynamicallyfor a period within the range of a fraction of a second up to themaximum number of seconds required for the ASR engine to transcribe avoice segment to text. Again, here, a CA may have control over thetiming between text presentation and HU voice broadcast and may preferone or the other of the text and voice to precede the other (e.g., HUvoice to proceed corresponding text by two seconds or vice versa). Inthese cases, the preferred delay between voice and text can bepersistent and unchanging which results in less CA confusion. Thus, forinstance, regardless of delay between an HU's initial utterance and ASRtext generation, both the utterance and the associated ASR text can bepersistently presented simultaneously in at least some embodiments.

After a CA corrects text errors in the ASR engine generated text, in atleast some cases the relay transmits the time stamped text back to theAU caption device for display to the AU. Upon receiving the time stampedtext from the relay, the AU device accesses the time stamped HU voicesignal stored thereat and associates the text and HU voice signalsegments based on similar (e.g., closest in time) or identical timestamps and stores the associated text and HU voice signal untilpresented and broadcasted to the AU. The AU device then simultaneously(or delayed per user preference) broadcasts the HU voice signal segmentsand presents the corresponding text to the AU via the AU caption devicein at least some embodiments.

A flow chart that is consistent with this simple first case of timestamping text segments is shown in FIG. 32 and will be described next.Referring also to FIG. 33, a system similar to the system describedabove with respect to FIG. 1 is illustrated where similar elements arelabelled with the same numbers used in FIG. 1 and, unless indicatedotherwise, operates in a similar fashion. The primary differencesbetween the FIG. 1 system and the system described in FIG. 33 is thateach of the AU caption device 12 and the relay 16 includes a memorydevice that stores, among other things, time stamped voice messagesegments corresponding to a received HU voice signal and that timestamps are transmitted between AU device 12 and relay server 30 (see1034 and 1036).

Referring to FIGS. 32 and 33, during a call between an HU using an HUdevice 14 and an AU using AU device 12, at some point, captioning isrequired by the AU (e.g., either immediately when the call commences orupon selection of a caption option by the AU) at which point AU device12 performs several functions. First, after captioning is to commence,at block 1102, the HU voice signal is received by the AU device 12. Atblock 1104, AU device 12 commences assignment and continues to assignperiodic time stamps to the HU voice signal segments received at the AUdevice. The time stamps include an initial time stamp t0 correspondingto the instant in time when captioning is to commence or some specificinstant in time thereafter as well as following time stamps. Inaddition, at block 1104, AU device 12 commences storing the received HUvoice signal along with the assigned time stamps that divide up the HUvoice signal into segments in AU device memory 1030.

Referring still to FIGS. 32 and 33, at block 1106, AU device 12transmits the HU voice signal segments to relay 16 along with theinitial time stamp t0 corresponding to the instant captioning wasinitiated where the initial time stamp is associated with the start ofthe first HU voice segment transmitted to the relay (see 1034 in FIG.33). At block 1108, relay 16 stores the initial time stamp t0 along withthe first HU voice signal segment in memory 1032, runs its own timer toassign subsequent time stamps to the HU voice signal received and storesthe HU voice signal segments and relay generated time stamps in memory1032. Here, because both the AU device and the relay assign the initialtime stamp t0 to the same point within the HU voice signal and eachassigns other stamps based on the initial time stamp, all of the AUdevice and relay time stamps should be aligned assuming that eachassigns time stamps at the same periodic intervals (e.g., every second).

In other cases, each of the AU device and relay may assign second andsubsequent time stamps having the form (t0+□t) where □t is a period oftime relative to the initial time stamp t0. Thus, for instance, a secondtime stamp may be (t0+1 sec), a third time stamp may be (t0+4 sec), etc.In this case, the AU device and relay may assign time stamps that have adifferent periods where the system simply aligns stamped text and voicewhen required based on closest stamps in time.

Continuing, at block 1110, relay 16 runs an ASR engine to generate ASRengine text for each of the stored HU voice signal segments and storesthe ASR engine text with the corresponding time stamped HU voice signalsegments. At block 1112, relay 16 presents the ASR engine text to a CAfor consideration and correction. Here, the ASR engine text is presentedvia a CA computer display screen 32 while the HU voice segments aresimultaneously (e.g., as text is scrolled onto display 32) broadcast tothe CA via headset 54. The CA uses display 32 and/or other interfacedevices to make corrections (see block 1116) to the ASR engine text.Corrections to the text are stored in memory 1032 and the resulting textis transmitted at block 1118 to AU device 12 along with a separate timestamp for each of the text segments (see 1036 in FIG. 33).

Referring yet again to FIGS. 32 and 33, upon receiving the time stampedtext, AU device 12 correlates the time stamped text with the HU voicesignal segments and associated time stamps in memory 1130 and stores thetext with the associated voice segments and related time stamps at block1120. At block 1122, in some embodiments, AU device 12 simultaneouslybroadcasts and presents the correlated HU voice signal segments and textsegments to the AU via an AU device speaker and the AU device displayscreen, respectively.

Referring still to FIG. 32, it should be appreciated that the timestamps applied to HU voice signal segments and corresponding textsegments enable the system to align voice and text when presented toeach of a CA and an AU. In other embodiments it is contemplated that thesystem may only use time stamps to align voice and text for one or theother of a CA and an AU. Thus, for instance, in FIG. 32, thesimultaneous broadcast step at 1112 may be replaced by voice broadcastand text presentation immediately when available and synchronouspresentation and broadcast may only be available to the AU at step 1122.In a different system synchronous voice and text may be provided to theCA at step 1112 while HU voice signal and caption text are independentlypresented to the AU immediately upon reception at steps 1102 and 1122,respectively.

In the FIG. 32 process, the AU only transmits an initial HU voice signaltime stamp to the relay corresponding to the instant when captioningcommences. In other cases it is contemplated that AU device 12 maytransmit more than one time stamp corresponding to specific points intime to relay 16 that can be used to correct any voice and text segmentmisalignment that may occur during system processes. Thus, for instance,instead of sending just the initial time stamp, AU device 12 maytransmit time stamps along with specific HU voice segments every 5seconds or every 10 seconds or every 30 seconds, etc., while a callpersists, and the relay may simply store each newly received time stampalong with an instant in the stream of HU voice signal received.

In still other cases AU device 12 may transmit enough AU devicegenerated time stamps to relay 16 that the relay does not have to runits own timer to independently generate time stamps for voice and textsegments. Here, AU device 12 would still store the time stamped HU voicesignal segments as they are received and stamped and would correlatetime stamped text received back from the relay 16 in the same fashion sothat HU voice segments and associated text can be simultaneouslypresented to the AU.

A sub-process 1138 that may be substituted for a portion of the processdescribed above with respect to FIG. 32 is shown in FIG. 34, albeitwhere all AU device time stamps are transmitted to and used by a relayso that the relay does not have to independently generate time stampsfor HU voice and text segments. In the modified process, referring alsoand again to FIG. 32, after AU device 12 assigns periodic time stamps toHU voice signal segments at block 1104, control passes to block 1140 inFIG. 34 where AU device 12 transmits the time stamped HU voice signalsegments to relay 16. At block 1142, relay 16 stores the time stamped HUvoice signal segments after which control passes back to block 1110 inFIG. 32 where the relay employs an ASR engine to convert the HU voicesignal segments to text segments that are stored with the correspondingvoice segments and time stamps. The process described above with respectto FIG. 32 continues as described above so that the CA and/or the AU arepresented with simultaneous HU voice and text segments.

In other cases it is contemplated that an AU device 12 may not assignany time stamps to the HU voice signal and, instead, the relay or afourth party ASR service provider may assign all time stamps to voiceand text signals to generate the correlated voice and text segments. Inthis case, after text segments have been generated for each HU voicesegment, the relay may transmit both the HU voice signal and thecorresponding text back to AU device 12 for presentation.

A process 1146 that is similar to the FIG. 32 process described above isshown in FIG. 35, albeit where the relay generates and assigns all timestamps to the HU voice signals and transmits the correlated time stamps,voice signals and text to the AU device for simultaneous presentation.In the modified process 1146, process steps 1150 through 1154 in FIG. 35replace process steps 1102 through 1108 in FIG. 32 and process steps1158 through 1162 in FIG. 35 replace process steps 1118 through 1122 inFIG. 32 while similarly numbered steps 1110 through 1116 aresubstantially identical between the two processes.

Process 1146 starts at block 1150 in FIG. 35 where AU device 12 receivesan HU voice signal from an HU device where the HU voice signal is to becaptioned. Without assigning any time stamps to the HU voice signal, AUdevice 12 links to a relay 16 and transmits the HU voice signal to relay16 at block 1152. At block 1154, relay 16 uses a timer or clock togenerate time stamps for HU voice signal segments after which controlpasses to block 1110 where relay 16 uses an ASR engine to convert the HUvoice signal to text which is stored along with the corresponding HUvoice signal segments and related time stamps. At block 1112, relay 16simultaneously presents ASR text and broadcasts HU voice segments to aCA for correction and the CA views the text and makes corrections atblock 1116. After block 1116, relay 16 transmits the time stamped textand HU voice segments to AU device 12 and that information is stored bythe AU device as indicated at block 1160. At block 1162, AU device 12simultaneously broadcasts and presents corresponding HU voice and textsegments via the AU device display.

In cases where HU voice signal broadcast is delayed so that thebroadcast is aligned with presentation of corresponding transcribedtext, delay insertion points will be important in at least some cases orat some times. For instance, an HU may speak for 20 consecutive secondswhere the system assigns a time stamp every 2 seconds. In this case, onesolution for aligning voice with text would be to wait until the entire20 second spoken message is transcribed and then broadcast the entire 20second voice message and present the transcribed text simultaneously.This, however, is a poor solution as it would slow down HU-AUcommunication appreciably.

Another solution would be to divide up the 20 second voice message into5 second periods with silent delays therebetween so that thetranscription process can routinely catch up. For instance, here, duringa first five second period plus a short transcription catch up period(e.g., 2 seconds), the first five seconds of the 20 second HU voicemassage is transcribed. At the end of the first 7 seconds of HU voicesignal, the first five seconds of HU voice signal is broadcast and thecorresponding text presented to the AU while the next 5 seconds of HUvoice signal is transcribed. Transcription of the second 5 seconds of HUvoice signal may take another 7 seconds which would meant that a 2second delay or silent period would be inserted after the first fiveseconds of HU voice signal is broadcast to the AU. In other cases theASR text and HU voice would be sent ASAP when generated or received todeliver to the AU. In this case the 7 seconds described would be tocomplete the segment as opposed to for getting the first words to the AUfor broadcast.

This process of inserting periodic delays into HU voice broadcast andtext presentation while transcription catches up continues. Here, whileit is possible that the delays at the five second times would be atideal times between consecutive natural phrases, more often than not,the 5 second point delays would imperfectly divide natural languagephrases making it more, not less difficult, to understand the overall HUvoice message.

A better solution is to insert delays between natural language phraseswhen possible. For instance, in the case of the 20 second HU voicesignal example above, a first delay may be inserted after a first 3second natural language phrase, a second delay may be inserted after asecond 4 second natural language phrase, a third delay may be insertedafter a third 5 second natural language phrase, a fourth delay may beinserted after a fourth 2 second natural language phrase and a fifthdelay may be inserted after a fifth 2 second natural language phrase, sothat none of the natural language phrases during the voice message arebroken up by intervening delays.

Software for identifying natural language phrases or natural breaks inan HU's voice signal may use actual delays between consecutive spokenphrases as one proxy for where to insert a transcription catch up delay.In some cases software may be able to perform word, sentence and/ortopic segmentation in order to identify natural language phrases. Othersoftware techniques for dividing voice signals into natural languagephrases are contemplated and should be used as appropriate.

Thus, while some systems may assign perfectly periodic time stamps to HUvoice signals to divide the signals into segments, in other cases timestamps will be assigned at irregular time intervals that make more sensegiven the phrases that an HU speaks, how an HU speaks, etc.

Voice Message Replay

Where time stamps are assigned to HU voice and text segments, voicesegments can be more accurately selected for replay via selection ofassociated text. For instance, see FIG. 36 that shows a CA displayscreen 50 with transcribed text represented at 1200. Here, as text isgenerated by a relay ASR engine and presented to a CA, consistent withat least some of the systems described above, the CA may select a wordor phrase in presented text via touch (represented by hand icon 1202) toreplay the HU voice signal associated therewith.

When a word is selected in the presented text several things will happenin at least some contemplated embodiments. First, a current voicebroadcast to the CA is halted. Second, the selected word is highlighted(see 1204) or otherwise visually distinguished. Third, when the word ishighlighted, the CA computer accesses the HU voice segment associatedwith the highlighted word and re-broadcasts the voice segment for the CAto re-listen to the selected word. Where time stamps are assigned withshort intervening periods, the time stamps should enable relativelyprecise replay of selected words from the text. In at least some cases,the highlight will remain and the CA may change the highlighted word orphrase via standard text editing tools. For instance, the CA may typereplacement text to replace the highlighted word with corrected text. Asanother instance, the CA may re-voice the broadcast word or phrase sothat software trained to the CA's voice can generate replacement text.Here, the software may use the newly uttered word as well as the wordsthat surround the uttered word in a contextual fashion to identify thereplacement word.

In some cases a “Resume” or other icon 1210 may be presented proximatethe selected word that can be selected via touch to continue the HUvoice broadcast and text presentation at the location where the systemleft off when the CA selected the word for re-broadcast. In other cases,a short time (e.g., ¼th second to 3 seconds) after rebroadcasting aselected word or phrase, the system may automatically revert back to thevoice and text broadcast at the location where the system left off whenthe CA selected the word for re-broadcast.

While not shown, in some cases when a text word is selected, the systemwill also identify other possible words that may correspond to the voicesegment associated with the selected word (e.g., second and third bestoptions for transcription of the HU voice segment associated with theselected word) and those options may be automatically presented fortouch selection and replacement via a list of touch selectable icons,one for each option, similar to Resume icon 1210. Here, the options maybe presented in a list where the first list entry is the most likelysubstitute text option, the second entry is the second most likelysubstitute text option, and so on.

Referring again to FIG. 36, in other cases when a text word is selectedon a CA display screen 50, a relay server or the CA's computer mayselect an HU voice segment that includes the selected word and alsoother words in an HU voice segment or phrase that includes the selectedword for re-broadcast to the CA so that the CA has some audible contextin which to consider the selected word. Here, when the phrase lengthsegment is re-broadcast, the full text phrase associated therewith maybe highlighted as shown at 1206 in FIG. 36. In some cases, the selectedword may be highlighted or otherwise visually distinguished in one wayand the phrase length segment that includes the selected word may behighlighted or otherwise visually distinguished in a second way that isdiscernably different to the CA so that the CA is not confused as towhat was selected (e.g., see different highlighting at 1204 and 1206 inFIG. 36).

In some cases a single touch on a word may cause the CA computer tore-broadcast the single selected word while highlighting the selectedword and the associated longer phrase that includes the selected worddifferently while a double tap on a word may cause the phrase thatincludes the selected word to be re-broadcast to provide audio context.Where the system divides up an HU voice signal by natural phrases,broadcasting a full phrase that includes a selected word should beparticularly useful as the natural language phrase should be associatedwith a more meaningful context than an arbitrary group of wordssurrounding the selected word.

Even if the system rebroadcasts a full phrase including a selected word,in at least some cases CA edits will be made only to the selected wordas opposed to the full phrase. Thus, for instance, in FIG. 36 where asingle word is selected but a phrase including the word is rebroadcast,any CA edit (e.g., text entry or text generated by software in responseto a revoiced word or phrase) would only replace the selected word, notthe entire phrase.

Upon selection of Resume icon 1210, the highlighting is removed from theselected word and the CA computer restarts simultaneously broadcastingthe HU voice signal and presenting associated transcribed text at thepoint where the computer left off when the re-broadcast word wasselected. In some cases, the CA computer may back up a few seconds fromthe point where the computer left off to restart the broadcast tore-contextualize the voice and text presented to the CA as the CA againbegins correcting text errors.

In other cases, instead of requiring a user to select a “Resume” option,the system may, after a short period (e.g., one second after theselected word or associated phrase is re-broadcast), simply revert backto broadcasting the HU voice signal and presenting associatedtranscribed text at the point where the computer left off when there-broadcast word was selected. Here, a beep or other audiblydistinguishable signal may be generated upon word selection and at theend of a re-broadcast to audibly distinguish the re-broadcast frombroadcast HU voice. In other cases any re-broadcast voice signal may beaudibly modified in some fashion (e.g., higher pitch or tone, greatervolume, etc.) to audibly distinguish the re-broadcast from other HUvoice signal broadcast.

To enable a CA to select a phrase that includes more than one word forrebroadcast or for correction, in at least some cases it is contemplatedthat when a user touches a word presented on the CA display device, thatword will immediately be fully highlighted. Then, while still touchingthe initially selected and highlighted word, the CA can slide her fingerleft or right to select adjacent words until a complete phrase to beselected is highlighted. Upon removing her finger from the displayscreen, the highlighted phrase remains highlighted and revoicing or textentry can be used to replace the entire highlighted phrase.

Referring now to FIG. 37, a screen shot akin to the screen shot shown inFIG. 26 is illustrated at 50 that may be presented to an AU via an AUdevice display, albeit where an AU has selected a word from withintranscribed text for re-broadcast. In at least some embodiments, similarto the CA system described above, when an AU selects a word frompresented text, the instantaneous HU voice broadcast and textpresentation is halted, the selected word is highlighted or otherwisevisually distinguished as shown at 1230 and the phrase including theselected word may also be differently visually distinguished as shown at1231. Beeps or other audible signals may be generated immediately priorto and after re-broadcast of a voice signal segment. When a word isselected, the AU device speaker (e.g., the speaker in associated handset22) re-broadcasts the HU voice signal that is associated through theassigned time stamp to the selected word. In other cases the AU devicewill re-broadcast the entire phrase or sub-phrase that includes theselected word to give audio context to the selected word.

Referring again to FIG. 37, when an AU selects a word forrebroadcasting, in at least some cases if that word is still on a CA'sdisplay screen when the AU selects the word, that word may be speciallyhighlighted on the CA display to alert or indicate to the CA that the AUhad trouble understanding the selected word. To this end, see in FIG. 36that the word selected in FIG. 37 is highlighted on the exemplary CAdisplay screen at 1201. Here, the CA may read the phrase including theword and either determine that the text is accurate or that atranscription error occurred. Where the text is wrong, the CA maycorrect the text or may simply ignore the error and continue on withtranscription of the continuing HU voice signal.

While the time stamping concept is described above with respect to asystem where an ASR initially transcribes an HU voice signal to text anda CA corrects the ASR generated text, the time stamping concept is alsoadvantageously applicable to cases where a CA transcribes an HU voicesignal to text and then corrects the transcribed text or where a secondCA corrects text transcribed by a first CA. To this end, in at leastsome cases it is contemplated that an ASR may operate in the backgroundof a CA transcription system to generate and time stamp ASR text (e.g.,text generated by an ASR engine) in parallel with the CA generated text.A processor may be programmed to compare the ASR text and CA generatedtext to identify at least some matching words or phrases and to assignthe time stamps associated with the matching ASR generated words orphrases to the matching CA generated text.

It is recognized that the CA text will likely be more accurate than theASR text most of the time and therefore that there will be differencesbetween the two text strings. However, some if not most of the time theASR and CA generated texts will match so that many of the time stampsassociated with the ASR text can be directly applied to the CA generatedtext to align the HU voice signal segments with the CA generated text.In some cases it is contemplated that confidence factors may begenerated for likely associated ASR and CA generated text and timestamps may only be assigned to CA generated text when a confidencefactor is greater than some threshold confidence factor value (e.g.,88/100). In most cases it is expected that confidence factors thatexceed the threshold value will occur routinely and with shortintervening durations so that a suitable number of reliable time stampscan be generated.

Once time stamps are associated with CA generated text, the stamps maybe used to precisely align HU voice signal broadcast and textpresentation to an AU or a CA (e.g., in the case of a second “correctingCA”) as described above as well as to support re-broadcast of HU voicesignal segments corresponding to selected text by a CA and/or an AU.

A sub-process 1300 that may be substituted for a portion of the FIG. 32process is shown in FIG. 38, albeit where ASR generated time stamps areapplied to CA generated text. Referring also to FIG. 32, steps 1302through 1310 shown in FIG. 38 are swapped into the FIG. 32 process forsteps 1112 through 1118. Referring also to FIG. 32, after an ASR enginegenerates and stores time stamped text segments for a received HU voicesignal segment, control passes to block 1302 in FIG. 38 where the relaybroadcasts the HU voice signal to a CA and the CA revoices the HU voicesignal to transcription software trained to the CA's voice and thesoftware yields CA generated text.

At block 1304, a relay server or processor compares the ASR text to theCA generated text to identify high confidence “matching” words and/orphrases. Here, the phrase high confidence means that there is a highlikelihood (e.g., 95% likely) that an ASR text word or phrase and a CAgenerated text word or phrase both correspond to the exact same HU voicesignal segment. Characteristics analyzed by the comparing processorinclude multiple word identical or nearly identical strings in comparedtext, temporally when text appears in each text string relative to otherassigned time stamps, easily transcribed words where both an ASR and aCA are highly likely to accurately transcribe words, etc. In some casestime stamps associated with the ASR text are only assigned to the CAgenerated text when the confidence factor related to the comparison isabove some threshold level (e.g., 88/100). Time stamps are assigned atblock 1306 in FIG. 38.

At block 1308, the relay presents the CA generated text to the CA forcorrection and at block 1310 the relay transmits the time stamped CAgenerated text segments to the AU device. After block 1310 controlpasses back to block 1120 in FIG. 32 where the AU device correlates timestamped CA generated text with HU voice signal segments previouslystored in the AU device memory and stores the times, text and associatedvoice segments. At block 1122, the AU device simultaneously broadcastsand presents identically time stamped HU voice and CA generated text toan AU. Again, in some cases, the AU device may have already broadcastthe HU voice signal to the AU prior to block 1122. In this case, uponreceiving the text, the text may be immediately presented via the AUdevice display to the AU for consideration. Here, the time stamped HUvoice signal and associated text would only be used by the AU device tosupport synchronized HU voice and text re-play or representation.

In some cases the time stamps assigned to a series of text and voicesegments may simply represent relative time stamps as opposed to actualtime stamps. For instance, instead of labelling three consecutive HUvoice segments with actual times 3:55:45 AM; 3:55:48 AM; 3:55:51 AM . .. , the three segments may be labelled t0, t1, t2, etc., where thelabels are repeated after they reach some maximum number (e.g., t20). Inthis case, for instance, during a 20 second HU voice signal, the 20second signal may have five consecutive labels t0, t1, t2, t3 and t4assigned, one every four seconds, to divide the signal into fiveconsecutive segments. The relative time labels can be assigned to HUvoice signal segments and also associated with specific transcribed textsegments.

In at least some cases it is contemplated that the rate of time stampassignment to an HU voice signal may be dynamic. For instance, if an HUis routinely silent for long periods between intermittent statements,time stamps may only be assigned during periods while the HU isspeaking. As another instance, if an HU speaks slowly at times and morerapidly at other times, the number of time stamps assigned to the user'svoice signal may increase (e.g., when speech is rapid) and decrease(e.g., when speech is relatively slow) with the rate of user speech.Other factors may affect the rate of time stamps applied to an HU voicesignal.

While the systems describe above are described as ones where time stampsare assigned to an HU voice signal by either or both of an AU's deviceand a relay, in other cases it is contemplated that other system devicesor processors may assign time stamps to the HU voice signal including afourth party ASR engine provider (e.g., IBM's Watson, Google Voice,etc.). In still other cases where the HU device is a computer (e.g., asmart phone, a tablet type computing device, a laptop computer), the HUdevice may assign time stamps to the HU voice signal and transmit toother system devices that need time stamps. All combinations of systemdevices assigning new or redundant time stamps to HU voice signals arecontemplated.

In any case where time stamps are assigned to voice signals and textsegments, words, phrases, etc., the engine(s) assigning the time stampsmay generate stamps indicating any of (1) when a word or phrase isvoiced in an HU voice signal audio stream (e.g., 16:22 to 16:22:5corresponds to the word “Now”) and (2) the time at which text isgenerated by the ASR for a specific word (e.g., “Now” generated at16:25). Where a CA generates text or corrects text, a processor relatedto the relay may also generate time stamps indicating when a CAgenerated word is generated as well as when a correction is generated.

In at least some embodiments it is contemplated that any time a CA fallsbehind when transcribing an HU voice signal or when correcting an ASRengine generated text stream, the speed of the HU voice signal broadcastmay be automatically increased or sped up as one way to help the CAcatch up to a current point in an HU-AU call. For instance, in a simplecase, any time a CA caption delay (e.g., the delay between an HU voiceutterance and CA generation of text or correction of text associatedwith the utterance) exceeds some threshold (e.g., 12 seconds), the CAinterface may automatically double the rate of HU signal broadcast tothe CA until the CA catches up with the call.

In at least some cases the rate of broadcast may be dynamic between anominal value representing the natural speaking speed of the HU and amaximum rate (e.g., increase the natural HU voice speed three times),and the instantaneous rate may be a function of the degree of captioningdelay. Thus, for instance, where the captioning delay is only 4 or lessseconds, the broadcast rate may be “1” representing the natural speakingspeed of the HU, if the delay is between 4 and 8 seconds the rebroadcastrate may be “2” (e.g., twice the natural speaking speed), and if thedelay is greater than 8 seconds, the broadcast rate may be “3” (e.g.,three times the natural speaking speed).

In other cases the dynamic rate may be a function of other factors suchas but not limited to the rate at which an HU utters words, perceivedclarity in the connection between the HU and AU devices or between theAU device and the relay or between any two components within the system,the number of corrections required by a CA during some sub-call period(e.g., the most recent 30 seconds), statistics related to how accuratelya CA can generate text or make text corrections at different speakingrates, some type of set AU preference, some type of HU preference, etc.

In some cases the rate of HU voice broadcast may be based on ASRconfidence factors. For instance, where an ASR assigns a high confidencefactor to a 15 second portion of HU voice signal and a low confidencefactor to the next 10 seconds of the HU voice signal, the HU voicebroadcast rate may be set to twice the rate of HU speaking speed duringthe first 15 second period and then be slowed down to the actual HUspeaking speed during the next 10 second period or to some otherpercentage of the actual HU speaking speed (e.g., 75% or 125%, etc.).

In some cases the HU broadcast rate may be at least in part based oncharacteristics of an HU's utterances. For instance, where an HU'svolume on a specific word is substantially increased or decreased, theword (or phrase including the word) may always be presented at the HUspeaking speed (e.g., at the rate uttered by the HU). In other cases,where the volume of one word within a phrase is stressed, the entirephrase may be broadcast at speaking speed so that the full effect of thestressed word can be appreciated. As another instance, where an HU drawsout pronunciation of a word such as “Well . . . ” for 3 seconds, theword (or phrase including the word) may be presented at the spoken rate.

In some cases the HU voice broadcast rate may be at least in part basedon words spoken by an HU or on content expressed in an HU's spokenwords. For instance, simple words that are typically easy to understandincluding “Yes”, “No”, etc., may be broadcast at a higher rate thancomplex words like some medical diagnosis, multi-syllable terms, etc.

In cases where the system generates text corresponding to both HU and AUvoice signals, in at least some embodiments it is contemplated thatduring normal operation only text associated with the HU signal may bepresented to an AU and that the AU text may only be presented to the AUif the AU goes back in the text record to review the text associatedwith a prior part of a conversation. For instance, if an AU scrolls backin a conversation 3 minutes to review prior discussion, ASR generated AUvoice related text may be presented at that time along with the HU textto provide context for the AU viewing the prior conversation.

In the systems described above, whenever a CA is involved in a captionassisted call, the CA considers an entire HU voice signal and eithergenerates a complete CA generated text transcription of that signal orcorrects ASR generated text errors while considering the entire HU voicesignal. In other embodiments it is contemplated that where an ASR enginegenerates confidence factors, the system may only present sub-portionsof an HU voice signal to a CA that are associated with relatively lowconfidence factors for consideration to speed up the error correctionprocess. Here, for instance, where ASR engine confidence factors arehigh (e.g., above some high factor threshold) for a 20 second portion ofan HU voice signal and then are low for the next 10 seconds, a CA mayonly be presented the ASR generated text and the HU voice signal may notbe broadcast to the CA during the first 20 seconds while substantiallysimultaneous HU voice and text are presented to the CA during thefollowing 10 second period so that the CA is able to correct any errorsin the low confidence text. In this example, it is contemplated that theCA would still have the opportunity to select an interface option tohear the HU voice signal corresponding to the first 20 second period orsome portion of that period if desired.

When a remote third party ASR engine generates and provides captions toa relay but does not provide confidence factors to the relay, in atleast some embodiments a local ASR run at a relay may generate a localcaption set in parallel and may use the local caption set to assessconfidence factors for captions received from the remote ASR engine.Here, a local processor may compare remote ASR captions to the locallygenerated ASR captions and may assign confidence factors to each remoteASR caption word or phrase based on results of the comparison. Forinstance, where there is a mismatch, a low confidence factor may beassigned to the remote ASR caption word or phrase. As another instance,a low confidence factor may only be assigned to the remote ASR captionword when the local ASR caption word is grammatically correct. Otheralgorithms for assessing low confidence are contemplated.

In other cases, the local processor may assess low confidence for aremote ASR word or phrase when the local ASR generates a plurality ofviable options (e.g., 2, 3, 4, etc.) for the word or phrase.

In some cases only a portion of HU voice signal corresponding to lowconfidence ASR engine text may be presented at all times and in othercases, this technique of skipping broadcast of HU voice associated withhigh confidence text may only be used by the system during thresholdcatch up periods of operation. For instance, the technique of skippingbroadcast of HU voice associated with high confidence text may only kickin when a CA text correction process is delayed from an HU voice signalby 20 or more seconds (e.g., via a threshold period).

In particularly advantages cases, low confidence text and associatedvoice may be presented to a CA at normal speaking speed and highconfidence text and associated voice may be presented to a CA at anexpedited speed (e.g., 3 time normal speaking speed) when a textpresentation delay (e.g., the period between the time an HU uttered aword and the time when a text representation of the word is presented tothe CA) is less than a maximum latency period, and if the delay exceedsthe maximum latency period, high confidence text may be presented inblock form (e.g., as opposed to rapid sequential presentation ofseparate words) without broadcasting the HU voice to expedite thecatchup process.

In cases where a system processor or sever determines when toautomatically switch or when to suggest a switch from a CA captioningsystem to an ASR engine captioning system, several factors may beconsidered including the following:

-   -   1. Percent match between ASR generated words and CA generated        words over some prior captioning period (e.g., last 30 seconds);    -   2. How accurate ASR confidence factors reflect corrections made        by a CA;    -   3. Words per minute spoken by an HU and how that affects        accuracy;    -   4. Average delay between ASR and CA generated text over some        prior captioning period;    -   5. An expressed AU preference stored in an AU preferences        database accessible by a system processor;    -   6. Current AU preferences as set during an ongoing call via an        on screen or other interface tool;    -   7. Clarity of received signal or some other proxy for line        quality of the link between any two processors or servers within        the system;    -   8. Identity of a HU conversing with an AU; and    -   9. Characteristics of a HU's voice signal.

Other factors are contemplated.

Handling Automatic and Ongoing ASR Text Corrections

In at least some cases a speech recognition engine will sequentiallygenerate a sequence of captions for a single word or phrase uttered by aspeaker. For instance, where an HU speaks a word, an ASR engine maygenerate a first “estimate” of a text representation of the word basedsimply on the sound of the individual word and nothing more. Shortlythereafter (e.g., within 1 to 6 seconds), the ASR engine may considerwords that surround (e.g., come before and after) the uttered word alongwith a set of possible text representations of the word to identify afinal estimate of a text representation of the uttered word based oncontext derived from the surrounding words. Similarly, in the case of aCA revoicing an HU voice signal to an ASR engine trained to the CA voiceto generate text, multiple iterations of text estimates may occursequentially until a final text representation is generated.

In at least some cases it is contemplated that every best estimate of atext representation of every word to be transcribed will be transmittedimmediately upon generation to an AU device for continually updatedpresentation to the AU so that the AU has the best HU voice signaltranscription that exists at any given time. For instance, in a casewhere an ASR engine generates at least one intermediate text estimateand a final text representation of a word uttered by an HU and where aCA corrects the final text representation, each of the interim textestimate, the final text representation and the CA corrected text may bepresented to the AU where updates to the text are made as in linecorrections thereto (e.g., by replacing erroneous text with correctedtext directly within the text stream presented) or, in the alternative,corrected text may be presented above or in some spatially associatedlocation with respect to erroneous text.

In cases where an ASR engine generates intermediate and final textrepresentations while a CA is also charged with correcting text errors,if the ASR engine is left to continually make context dependentcorrections to text representations, there is the possibility that theASR engine could change CA generated text and thereby undue an intendedand necessary CA correction.

To eliminate the possibility of an ASR modifying CA corrected text, inat least some cases it is contemplated that automatic ASR enginecontextual corrections for at least CA corrected text may be disabledimmediately after a CA correction is made or even once a CA commencescorrecting a specific word or phrase. In this case, for instance, when aCA initiates a text correction or completes a correction in textpresented on her device display screen, the ASR engine may be programmedto assume that the CA corrected text is accurate from that pointforward. In some cases, the ASR engine may be programmed to assume thata CA corrected word is a true transcription of the uttered word whichcan then be used as true context for ascertaining the text to beassociated with other ASR engine generated text words surrounding thetrue or corrected word. In some cases text words prior to and followingthe CA corrected word may be corrected by the ASR engine based on the CAcorrected word that provides new context or independent of that contextin other cases. Hereinafter, unless indicated otherwise, when an ASRengine is disabled from modifying a word in a text phrase, the word willbe said to be “firm”.

In cases where CA activity renders a word or phrase firm so that furtherASR corrections are not presented to a CA or an AU, the ASR may stillgenerate error corrections for the firm words or phrases for otherpurposes. For instance, in at least some cases where an ASR generates achange for a word or phrase after that word or phrase has been made firmby CA action and the ASR change does not match the firm word or phrase,without changing the word or phrase that word or phrase may still behighlighted or otherwise visually distinguished for the CA so that theCA is at least aware that the new ASR hypothesis on the word or phrasedoes not match the firm word or phrase. Here, the CA may simply ignorethe mismatch indicated or elect to reconsider the word or phrase forerror correction.

As another instance, where an ASR generates a change for a word orphrase after that word or phrase has been made firm by CA action, aprocessor programmed to assess ASR accuracy may compare the firm text tothe ASR text change as part of the accuracy calculation. Thus, takingthe firm text to be truth (e.g., accurate), the processor may beprogrammed to increase ASR accuracy rating when the ASR text changematches the firm text and to decrease that rating when there is amismatch.

In still other embodiments it is contemplated that after a CA listens toa word or phrase broadcast to the CA or some short duration of timethereafter, the word or phrase may become firm irrespective of whetheror not a CA corrects that word or phrase or another word or phrasesubsequent thereto. For instance, in some cases once a specific word isbroadcast to a CA for consideration, the word may be designated firm. Inthis case each broadcast word is made firm immediately upon broadcast ofthe word and therefore after being broadcast, no word is automaticallymodified by an ASR engine. Here the idea is that once a CA listens to abroadcast word and views a representation of that word as generated bythe ASR engine, either the word is correct or if incorrect, the CA islikely about to correct that word and therefore an ASR correction couldbe confusing and should be avoided.

As another instance, in some cases where a word forms part of a largerphrase, the word and other words in the phrase may not be designatedfirm until after either (1) a CA corrects the word or a word in thephrase that is subsequent thereto or (2) the entire phrase has beenbroadcast to the CA for consideration. Here, the idea is that in manycases a CA will have to listen to an entire phrase in order to assessaccuracy of specific transcribed words so firming up phrase words priorto complete broadcast of the entire phrase may be premature.

In still other cases, a processor may recognize word phrases within ASRtext and firm up an entire phrase just prior to or at the instant thefirst word in the phrase is broadcast to a CA for consideration. Thus,for instance, where a processor identifies a phrase including 8 words,at the instant in time when the first word is broadcast to the CA forconsideration, the entire 8 word phrase may be made firm so that the ASRis not modifying text in the phrase as the CA is considering how tocorrect the ASR captions. The idea here is that a CA may find itdistracting to listen to an HU broadcast while trying to correct ASRtext when the ASR text is changing in real time.

In other cases a system processor may firm up all ASR text that iswithin some number or words or seconds of HU voice signal of a currentword being broadcast to a CA for correction. For instance, all ASR textwords within 8 words of a current word being broadcast to a CA may berendered firm so that they do not change on the CA display screen. Here,in some cases when a word is firm for the CA, the word may also be firmfor the AU (e.g., the ASR will not firm up the word and only CA errorcorrections to the word would be sent along to the AU for in line orother correction). In other cases, while a word may be firm for a CA,automatic ASR error corrections may still be sent along to the AUcaptioned device for in line corrections until the CA makes final errorcorrections at which point the captions presented to the AU prior to aCA final correction would be made firm.

As yet one other instance, in some cases automatic firm designations maybe assigned to each word in an HU voice signal a few seconds (e.g., 3seconds) after the word is broadcast, a few words (e.g., 5 words) afterthe word is broadcast, or in some other time related fashion.

In at least some cases it is contemplated that if a CA corrects a wordor words at one location in presented text, if an ASR subsequentlycontextually corrects a word or phrase that precedes the CA correctedword or words, the subsequent ASR correction may be highlighted orotherwise visually distinguished so that the CA's attention is calledthereto to consider the ASR correction. In at least some cases, when anASR corrects text prior to a CA text correction, the text that wascorrected may be presented in a hovering tag proximate the ASRcorrection and may be touch selectable by the CA to revert back to thepre-correction text if the CA so chooses. To this end, see the CAinterface screen shot 1391 shown in FIG. 43 where ASR generated text isshown at 1393 that is similar to the text presented in FIG. 39, albeitwith a few corrections. More specifically, in FIG. 43, it is assumedthat a CA corrected the word “cods” to “kids” at 1395 (compare again toFIG. 39) after which an ASR engine corrected the prior word “bing” to“bring”. The prior ASR corrected word is highlighted or distinguished asshown at 1397 and the word that was changed to make the correction ispresented in hovering tag 1399. Tag 1399 is touch selectable by the CAto revert back to the prior word if selected.

In other cases where a CA initiates or completes a word correction, theASR engine may be programmed to disable generating additional estimatesor hypothesis for any words uttered by the HU prior to the CA correctedword or within a text segment or phrase that includes the correctedword. Thus, for instance, in some cases, where 30 text words appear on aCA's display screen, if the CA corrects the fifth most recentlypresented word, the fifth most recently corrected word and the 25preceding words would be rendered firm and unchangeable via the ASRengine. Here, in some cases the CA would still be free to change anyword presented on her display screen at any time. In other cases, once aCA corrects a word, that word and any preceding text words may be firmas to both the CA and the ASR engine.

In at least some embodiments a CA interface may be equipped with somefeature in addition to error correction that enables a CA to firm up allcurrent text results prior to some point in a caption representation onthe CA's and AU's display screens. For instance, in some cases aspecific simultaneous keyboard selection like the “Esc” key and an “F 1”key while a cursor is at a specific location in a caption representationmay cause all text that precedes that point, whether ASR initial, ASRcorrected, CA initial or CA corrected, to become firm. As anotherinstance, in at least some cases where a CA's display screen is touchsensitive, a CA may contact the screen at a location associated with acaptioned word and may perform some on screen gesture to indicate thatwords prior thereto should be made firm. For example, the on screengesture may include a swipe upward, a double tap, or some other gesturereserved for firming up prior captioned text on the screen.

In still other cases, a CA may have a “Firm” or other labelled button orselectable on screen icon which, when selected by a CA, firms up allinstantaneous caption text. In this regard, see the “Firm” screen icon799 in FIG. 23 which, when selected, firms up most recent caption textso that no other ASR corrections are suggested or made. Here, instead ofa selectable button or icon, in some cases the CA workstation processormay be programmed instead to receive CA caption control voice commandsincluding, among others, a voice control command to firm up most currenttext. Here, the CA voice command would be filtered out of any CA voicesignal that is captioned so that it does not affect a CA transcription.

In still other cases, an AU may be able to firm up text by selecting anon screen icon (see 221 in FIG. 17) or via an AU voice command. Again,the AU voice command would be filtered out of the AU voice signaltransmitted to the HU device.

In still other cases one or more interface output signals may be used bya CA to help the CA track the CA's correction efforts. For instance,whenever a CA corrects a word or phrase in caption text, all text priorto and including the correction may be highlighted or otherwise visuallydistinguished (e.g., text color changed) to indicate the point of themost recent CA text change. Here, in some cases, the CA could still makechanged prior to the most recent change but the color change to indicatethe latest change in the text would persist. In still other cases the CAmay be able to select specific keys like an “Esc” key and some other key(e.g., “F2”) to change text color prior to the selected point as anindication to the CA that prior text has already been considered. Instill other cases it is contemplated that on screen “checked” optionsmay be presented on the CA screen that are selectable to indicate thattext prior thereto has been considered and the color should be changed.To this end see FIG. 50 where “Checked” icons (two labelled 1544) arepresented after each punctuation mark to separate consecutive sentencesin ASR generated text 1540. Here, if one of the checked icons isselected, text prior thereto may be highlighted or otherwise visuallydistinguished to indicate prior correction consideration.

While not shown, whenever text is firmed up and/or whenever a CA hasindicated that text has been considered for correction, in addition toindicating that status on the CA display screen, in at least some casesthat status may be indicated in a similar fashion on an AU devicedisplay screen. For instance, an on screen indicator may hover over apoint in text presented to an AU where all text prior to that locationis firm. The firm indicator may move smoothly along a line of text astext if firmed up during the course of a call.

When a CA firms up specific text, in at least some cases even if the CAis listening to HU voice signal prior to the point at which the text isfirmed up, the system may automatically jump the HU voice broadcastpoint to the firmed up point so that the CA does not hear theintervening HU voice signal. When a voice signal jumps ahead, a warningmay be presented to the CA on the CA's display screen confirming thejump ahead. In other cases the CA may still have to listen to theintervening HU voice signal. In still other cases the system may playthe intervening HU voice signal at a double, triple or some othermultiple of the original speech rate to expedite the process of workingthrough the intervening voice signal.

It has been recognized that excessive error corrections can bedistracting to an AU. Thus, for instance, if an ASR automaticallycorrects ASR text twice and a CA corrects the text once in rapidsuccession, the three rapid error corrections may distract an AU fromher conversation with an HU. For this reason, in at least some cases itis contemplated that the number of automatic ASR error correctionspassed on to an AU captioned device for in line correction may belimited to, for instance, a single error correction, two errorcorrections, etc. Here, all ASR error corrections may still be used tocorrect non-firm ASR text presented to a CA in some cases and, in othercases the number of ASR error corrections used to correct text presentedto a CA may be limited (e.g., one, two, etc.).

In still other cases, while initially generated ASR text may beimmediately transmitted to an AU and a CA for consideration, automaticASR error corrections may only be presented initially to a CA (e.g., theAU would not see any automatic ASR error corrections). In this case,once ASR error corrections and CA error corrections become firm, all ofthose firm corrections would be transmitted to the AU device for in lineor other correction. The idea here is that the AU would only receive amaximum of one error correction per word in displayed text once acaption is firm (e.g. fully error corrected by the ASR and CA). Again,the advantage of a single round of text correction for an AU is lessdistraction during an ongoing call.

In some cases, the degree to which automated ASR error corrections areused to automatically correct text presented to an AU may be dynamic.For instance, if recent ASR error correction rates are high (e.g., 50%(e.g., a threshold) words being automatically corrected by ASR), thesystem may automatically stop sending ASR error corrections to an AU andinstead only use the ASR error corrections to in line correct textpresented to a CA.

As another instance, where recent ASR error correction accuracy(accuracy is different than error rate) for a specific call (e.g., basedon comparison of ASR error corrected text to CA corrections) is low(e.g., below some threshold level), the system may automatically stopsending ASR corrections to the AU and instead may present thosecorrections only to the CA for consideration. Here, if at a differentpoint in time during the call ASR error correction accuracy exceeds thethreshold level or some other threshold level, the ASR error correctionsmay again be transmitted to the AU device for in line correction. Forexample, at the beginning of a captioning session, automatic ASR captioncorrections may be transmitted immediately when generated to an AUdevice to drive in line corrections. Over time, as a CA corrects initialASR captions as well as automatic ASR caption corrections, a processormay compare CA corrections to ASR corrected text to assess accuracy ofthe automatic ASR corrections on a rolling basis. If the automatic ASRcorrections accuracy rate is below a threshold level (e.g., 80%) forsome duration of time (e.g., 10 seconds), the processor may then stoptransmitting the automatic ASR corrections to the AU device and insteadmay simply present those corrections to the CA at the relay forconsideration and CA correction.

In still other cases, whether or not initial ASR text is transmitted toan AU device for display may be dynamic and based on ASR accuracy, thistime by comparing initial ASR text to CA corrected text on an ongoingbasis. For instance, where initial ASR text accuracy is below somethreshold level, the system may automatically stop transmitting thattext to the AU device for display and instead may simply present thattext to the CA for error correction. Here, if the ASR accuracy increasesand exceeds the threshold level at a later time during a call, thesystem may automatically start transmitting the initial ASR text to theAU device for display with error corrections transmitted subsequentlyfor in line correction.

In still other cases initial ASR text and ASR error corrections may bothbe dynamically controlled in an optimized fashion to provide optimalcaptioning service to an AU. For instance, initial ASR text may only betransmitted to an AU device for display when accuracy is above a firstthreshold and ASR error corrections may only be transmitted to the AUdevice when the ASR error correction accuracy exceeds a second accuracythreshold. Here, when accuracy fluctuates, system operation may adaptautomatically back and forth between ASR text and error correctiontransmission to the AU device and blocking that transmission.

In other cases where an ASR or other system processor identifiesconfidence factors for ASR text and error corrections, the system mayonly automatically transmit ASR captions to the AU device that areassociated with high confidence factor values and may wait for CAconsideration of other ASR text and error corrections in other cases.Here, the idea is that low confidence ASR text will often be wrong andtherefore presenting that text to an AU may simply prove confusing. WhenASR text is high confidence, it can be used to speed up delivery ofcaptions to an AU but when low confidence, it can be delayed until theCA error corrects and the text is firmed up.

Just as automatic ASR corrections may or may not be presented to an AUbased on correction accuracy, the automatic ASR corrections may not bepresented to a CA if accuracy drops below some threshold level. Here, ineffect, if the ASR correction accuracy is too low, it may simply befaster for a CA to correct initial ASR captions without the distractionof automated error corrections from the ASR. In still other cases it iscontemplated that automatic ASR corrections may be transmitted to an AUdevice all the time for immediate in line correction of non-firm textand the automatic ASR corrections may only be turned on and off for theCA in a dynamic fashion based on a rolling accuracy calculation.

In still other cases, whether or not ASR error corrections aretransmitted to an AU device to drive caption corrections may be based atleast in part or entirely on other factors. For instance, where HU andAU conversation rate is rapid (e.g., a high words per minute count thatexceeds some threshold level), the system may be programmed to transmitall error corrections to an AU device and, where the conversation rateis below the threshold level, the system may be programmed to foregotransmitting automatic ASR error corrections to the AU device or to onlytransmit first error corrections for any text to the AU device.

In at least some cases an AU device may support automatic triggers thatcause CA activity to skip forward to a current time. For instance, in anASR-CA backed up mode, in at least some cases where an AU has at leastsome hearing capability, it may be assumed that when an AU speaks, theAU is responding to a most recent HU voice signal broadcast andtherefore understood the most recent HU voice signal and therefore thatthe AU's understanding of the conversation is current. Here, assumingthe AU has a current understanding, the system may automatically skip CAerror correction activities to the current HU voice signal andassociated ASR text so that any error correction delay is eliminated.

In a similar fashion, in a CA caption mode, if an AU speaks, based onthe assumption that the AU has a current understanding of theconversation when she speaks the system may automatically skip CA textgeneration and error correction activities to a current HU voice signalso that any text generation and error correction delay is eliminated. Inthis case, because there is no ASR text prior to the delay skipping, inparallel with the skipping activity, an ASR may generate fill in textautomatically for the HU voice signal not already captioned by the CA.Any skipping ahead based on AU speech may also firm up all textpresented to the AU prior to that point as well as any fill in textwhere appropriate.

In cases where an AU's voice signal operates as a catch up trigger, inat least some cases the trigger may require absence of typical words orphrases that are associated with a confused state. For instance, anexemplary phrase that indicates confusion may be “What did you say?” Asanother instance, an exemplary phrase may be “Can you repeat?” In thiscase, several predefined words or phrases may be supported by the systemand, any time one of those words or phrases is uttered by an AU, thesystem may forego skipping the delayed period so that CA errorcorrection or CA captioning with error correction continues unabated.

In other cases the relay server may apply artificial intelligence torecognize when a word or phrase likely indicates confusion and similarlymay forego skipping the delayed period so that CA error correction or CAcaptioning with error correction continues unabated. If the AU's utteredword or phrase is not associated with confusion, as described above, theCA activities (e.g., error correction or captioning and errorcorrection) are skipped ahead to the current HU voice signal.

In still other cases, a system processor may be programmed to applyartificial intelligence to HU voice signal as well as AU voice signal toassess meaning of HU and AU utterances and therefore meaning of andprogression of conversations as well as the AU's state of understanding.This contextual analysis can be used to assess when an AU is caught upwithin a conversation and can be used as a smart trigger for skipping aCA ahead within an HU voice signal to minimize CA error correctiondelay. For instance, the processor may be programmed to understand themeaning of an HU query “What do you think?” and an AU response “I thinkthat works on my end. What time were you thinking?” Here, byunderstanding the query and response, the processor can ascertain thatthe AU likely understood the query and therefore is currently caught upin the conversation. In this case, the CA can be automatically skippedahead within the HU voice signal to a current instant in the HU voicesignal and ASR captions and can error correct from that point on.

In some cases there may be restrictions on text corrections that may bemade by a CA. For instance, in a simple case where an AU device can onlypresent a maximum of 50 words to an AU at a time, the system may onlyallow a CA to correct text corresponding to the 50 words most recentlyuttered by an HU. Here, the idea is that in most cases it will make nosense for a CA to waste time correcting text errors in text prior to themost recently uttered 50 words as an AU will only rarely care to back upin the record to see prior generated and corrected text. Here, thewindow of text that is correctable may be a function of several factorsincluding font type and size selected by an AU on her device, the typeand size of display included in an AUs device, etc. This feature ofrestricting CA corrections to AU viewable text is effectively a limit onhow far behind CA error corrections can lag.

In some cases it is contemplated that a call may start out with full CAerror correction so that the CA considers all ASR engine generated textbut that, once the error correction latency exceeds some thresholdlevel, that the CA may only be able to or may be encouraged to onlycorrect low confidence text. For instance, the latency limit may be 10seconds at which point all ASR text is presented but low confidence textis visually distinguished in some fashion designed to encouragecorrection. To this end see for instance FIG. 40 where low and highconfidence text is presented to a CA in different scrolling columns. Insome cases error correction may be limited to the left column lowconfidence text as illustrated. FIG. 40 is described in more detailhereafter. Where only low confidence text can be corrected, in at leastsome cases the HU voice signal for the high confidence text may not bebroadcast.

As another example, see FIG. 40A where a CA display screen shot 1351includes the same text 1353 as in FIG. 40 presented in a scrollingfashion and where phrases (only one shown) that include one or somethreshold of low confidence factor words are visually distinguished(e.g., via a field border 1355, via highlighting, via different textfont characteristics, etc.) to indicate the low confidence factor wordsand phrases. Here, in some cases the system may only broadcast the lowconfidence phrases skipping from one to the next to expedite the errorcorrection process. In other cases the system may increase the HU voicesignal broadcast rate (e.g., 2×, 3×, etc.) between low confidencephrases and slow the rate down to a normal rate during low confidencephrases so that the CA continues to be able to consider low confidencephrases in full context.

In some cases, only low confidence factor text and associated HU voicesignal may be presented and broadcast to a CA for consideration withsome indication of missing text and voice between the presented textsegments. For instance, turn piping representations (see again 216 inFIG. 17) may be presented to a CA between low confidence editable textphrases.

Referring to FIG. 40B, a series of two consecutive CA interface screenshots 2350 and 2352 are illustrated that show ASR captions correspondingto consecutive low confidence caption text for an HU voice signal. TheASR captions are shown at 2360 where it can be seen that there is a 34second high confidence caption segment separating two caption segments2362 and 2364 that include low confidence phrases or words. The firstlow confidence caption segment 2362 is presented on a CA display screenat a first time as shown at 2366 where a line of captions that includesa low confidence factor captioned word or phrase (e.g., as identified byan ASR) is located within a low confidence caption field 2368 and wherehigh confidence caption text is presented prior to and after field 2368to provide context. A low confidence word is visually distinguished insome fashion (e.g., highlighted, underlined, bold, etc.) within linefield 2368. In the illustrated example the low confidence word isdistinguished by placing that word in a low confidence word/phrase field2370 that includes a portion of the line field 2368 as illustrated at2370 so that a CA can quickly identify the low confidence factor word orphrase.

The second low confidence caption segment 2364 is presented on the CAdisplay screen at a second time as shown at 2380 where a second line ofcaptions that includes a low confidence factor captioned word or phrase(e.g., as identified by an ASR) is again located within a low confidencecaption field 2368 and where high confidence caption text is presentedprior to and after field 2368 to provide context. A low confidence wordor phrase is again visually distinguished in some fashion (e.g.,highlighted, underlined, bold, etc.) within line field 2368. In theillustrated example the low confidence word is distinguished by placingthat word in a low confidence word/phrase field 2370 that includes aportion of the line field 2368 as illustrated at 2370 so that a CA canquickly identify the low confidence factor word or phrase. Here, thesystem would essentially skip from one low confidence word or phrase andassociated caption text to the next as the CA either verifies lowconfidence words or phrases or replaces those words or phrases. Thus,for instance, immediately upon the CA replacing the work “ketchup” with“catch a” in FIG. 40B, the CA interface would be updated to present thescreen shot shown at 2352 including the next low confidence word orphrase as well as additional surrounding captions for context.

In at least some cases when the interface transitions from presenting afirst low confidence segment to a second low confidence segment, thetransition may appear as a scrolling upward to simulate a sense ofmoving forward in time In some cases the interface may present someindicator of the duration of HU voice signal that is not presented tothe CA for error correction (e.g., in the present example, a 34 secondindication). In other cases a transition may include a rapid defocusing(e.g., 1 second) of the first low confidence segment and refocusingwhere the second low confidence factor segment is presented. This may beparticularly useful in cases where fields 2368 and 2370 remainstationary while caption segments are replaced.

Again, in alternative systems, all ASR text may be presented to the CAand all HU voice signal may be broadcast to the CA where high confidencefactor words and phrases are presented at an increased speed and in anormal non-distinguished way and where low confidence factor captionsare distinguished and associated voice signal is broadcast at thespeaking speed of the HU.

In some embodiments, as illustrated in FIG. 40B, the low confidence linefield 2368 and word/phrase field 2370 may be stationary as the systemmoves from one low confidence caption segment to another. To this end,see that the field 2370 and field 2368 remain at the exact samelocations in interfaces 2366 and 2380 despite different caption textbeing presented. The idea here is that all low confidence words orphrases will appear, one at a time, within the low confidenceword/phrase field 2370 with contextual high confidence text there aroundso that a CA can simply focus on the one field 2370 to view consecutivelow confidence captioned words and phrases which should reduce eye andother physical strain on CAs.

Referring still to FIG. 40B, a CA workstation processor also generatescorrection options for a CA for each low confidence word or phrasepresented for consideration. In this regard, see that a pop up window2372 is presented as part of interface 2350 and presents selectableicons corresponding to different correction and verification options.The first illustrated option can be selected via any type of user CAuser interface device to confirm that the current caption text iscorrect and the other options are each selectable to replace the captiontext in filed 2368 with the selected option. In some embodiments theoriginally presented caption text in field 2368 is the best of alloptions and other options in field 2372 are presented in an order wherethe second most likely option is presented second, the third best optionthird, etc.

When the OK-Next option is selected, the processor immediately skipsahead to the next low confidence factor word or phrase and presents thatword or phrase in field 2370 with surrounding text (see 2380) before andafter for context as shown in FIG. 40B at 2352. In the present exampleit should be appreciated that by skipping ahead to the next lowconfidence factor captions for error correction, the system simply skipsthe 34 seconds of high confidence factor captions and HU voice signalwhich expedites the caption correction process and also reduces CAstress appreciably.

Referring again to FIG. 40B and specifically to screen interface 2350,if any of the second through fourth replacement caption options in field2372 is selected, the caption text in field 2370 is replaced by theselected option and that error correction is transmitted to a connectedAU captioned device to drive error corrections to the captions displayedby that device. In addition, again, the processor immediately skipsahead to the next low confidence factor word or phrase and presents thatword or phrase in field 2370 with surrounding text before and after forcontext as shown in FIG. 40B at 2352. Again, when the system skips tothe next low confidence word or phrase, the fields 2368 and 2370 do notmove and instead, the text segment(s) that include the low confidencefactor word or phrase are formatted so that the low confidence word orphrase can be placed in field 2370 with contextual text surrounding thatword or phrase.

Referring still to FIG. 40B, in at least some embodiments a CA also hasthe ability to modify any text within a presented phrase including textwithin field 2370. Where a CA changes text in field 2370, once thatchange is entered, again, the processor immediately identifies andpresents the caption segment(s) that includes the next low confidencefactor word or phrase.

As in any interface that requires repetitive activity, any way tominimize required user burden to comprehend system output and provideuser input is important. As described above, one way to reduce CA burdenrelated to comprehending system output is in the way captions arepresented to the CA for error correction. Again, by freezing fields 2368and 2370 and simply populating those fields with consecutive lowconfidence factor captions, the burden of shifting sight trajectoryaround on the display screen is lessened.

To lessen the burden related to CA input to the system, one featurealready described includes the error correction options field 2372 (seeagain FIG. 40B) where error correction options are presented to the CAfor any low confidence words or phrases. In at least some embodimentsthe options field 2372, like fields 2368 and 2370, may be stationary andsimply repopulated with error correction options for each low confidencefactor phrase presented in field 2370. In at least some cases theoptions in field 2372 may be selected via an on screen cursor associatedwith a mouse input device 2365. In other cases, each option in field2372 may be labelled with a number which is selectable via a keyboarddevice 2361 or via a microphone includes in a CA headset 2363. Thus, forinstance, a CA may utter “three” while the word “ketchup” is in field2368 to replace that word with the phrase “catch a”. In still othercases the CA may utter the suggested replacement option (e.g., “catcha”) to initiate a replacement.

Referring still to FIG. 40B, once a low confidence word or phrase ispresented in field 2368, a system processor may be programmed to assumethat any input from a CA is to be applied to the word or phrase in field2368 so that there is no need to manually select field 2368 prior tochanging the caption in the field or affirming that the caption in thefield is accurate. Thus, for example, while viewing the screen shotshown at 2350 in FIG. 40B, if a CA wants to type in replacement text forthe word “ketchup”, she may simply start typing without having to selectthe word “ketchup” prior. Similarly, subsequently, while viewing thescreen shot shown at 2352 in FIG. 40B, if a CA wants to type in areplacement for the phrase “warp gown”, she may simply start typingwithout having to select that phrase prior. In other cases the CA wherea CA can voice a replacement word or phrase and software then generatesreplacement text, while the CA is viewing the screen shot shown at 2350in FIG. 40B, if a CA wants to replace the word “ketchup”, she may simplyvoice the replacement word or phrase without having to select the word“ketchup” prior.

Thus, in one optimized CA interface, a CA may simply view consecutivelow confidence factor captions one at a time where low confidence wordsand phrases are highlighted one at a time and where the highlighting ofthe word or phrase operates as an automatic selection thereof for errorcorrection so that no CA selection step or action is required. Byeliminating the phrase selection process, physical stress on a CA can besubstantially reduced.

It should be appreciated that even in a system where initial selectionof consecutive low confidence factor words and phrases is automated, aCA may be able to manually select any word or phrase presented on adisplay screen via a cursor, touch, or the like, so that any text, evenhigh confidence text, can be edited or replaced as desired. Once a CAcompletes replacing a high confidence factor word or phrase, the systemmay be programmed to revert back to skipping from one low confidencefactor phrase to another as described above.

Referring now to FIG. 40C, yet another exemplary CA interface screenshot 2600 and exemplary CA input devices are illustrated where theinterface includes several other advantageous content display featuresand arrangement limitations. To this end, ASR captioned text associatedwith an HU voice signal is presented in a scrolling fashion at 2602where low confidence factor words and phrases are highlighted orotherwise visually distinguished at 2608, 2612, 2614 and 2616 in a firstway. A text word currently being broadcast to a CA is visuallydistinguished in a second way 2606 and that word and other text within aline of text including that word are located within a stationary textline field 2604 (e.g., the captions scroll upward so that the text lineincluding a currently broadcast word is always located within the filed2604 so that a CA can primarily concentrate sight at the same verticallocation on the screen to view currently broadcast text).

Referring still to FIG. 40C, the caption text is presented in a rightjustified arrangement (see phantom line 2615) and is arranged so thatevery low confidence word or phrase is presented along the rightjustified line which makes it easy for a CA to locate low confidencecaption words and phrases as the captions scroll upward. Correctionoptions for any low confidence word or phrase at the right end of field2604 are presented in an options field or window 2610.

Referring still to FIG. 40C, a CA may select any of the presented textin at least some embodiments to make changes. A CA may use a mouse 2365,keyboard 2361 or headset 2362 to make changes in any of the mannersdescribed above. In at least some cases a CA listening to an HU voicesignal broadcast may simply ignore any of the low confidence factorindicators allowing current words and phrases to pass off the screen.When a phrase or word passes off the screen without amendment, that wordor phrase may become firm. Thus, in some cases a CA's inaction is usedas confirmation that current text captions are accurate.

In other cases, while interim and final ASR engine text may be presentedto an AU, a CA may only see final ASR engine text and therefore only beable to edit that text. Here, the idea is that most of the time ASRengine corrections will be accurate and therefore, by delaying CAviewing until final ASR engine text is generated, the number of requiredCA corrections will be reduced appreciably. It is expected that thissolution will become more advantageous as ASR engine speed increases sothat there is minimal delay between interim and final ASR engine textrepresentations.

In still other cases it is contemplated that only final ASR engine textmay be sent on to an AU for consideration. In this case, for instance,ASR generated text may be transmitted to an AU device in blocks wherecontext afforded by surrounding words has already been used to refinetext hypothesis. For instance, words may be sent in five word textblocks where the block sent always includes the 6th through 10th mostrecently transcribed words so that the most recent through fifth mostrecent words can be used contextually to generate final text hypothesisfor the 6th through 10th most recent words. Here, CA text correctionswould still be made at a relay and transmitted to the AU device for inline corrections of the ASR engine final text.

In this case, if a CA takes over the task of text generation from an ASRengine for some reason (e.g., an AU requests CA help), the system mayswitch over to transmitting CA generated text word by word as the textis generated. In this case CA corrections would again be transmittedseparately to the AU device for in line correction. Here, the idea isthat the CA generated text should be relatively more accurate than theASR engine generated text and therefore immediate transmission of the CAgenerated text to the AU would result in a lower error presentation tothe AU.

While not shown, in at least some embodiments it is contemplated thatturn piping type indications may be presented to a CA on her interfacedisplay as a representation of the delay between the CA text generationor correction and the ASR engine generated text. To this end, see theexemplary turn piping 216 in FIG. 17. A similar representation may bepresented to a CA.

Where CA corrections or even CA generated text is substantially delayed,in at least some cases the system may automatically force a split tocause an ASR engine to catch up to a current time in a call and to firmup (e.g., disable a CA from changing the text) text before the splittime. In addition, the system may identify a preferred split prior towhich ASR engine confidence factors are high. For instance, where ASRengine text confidence factors for spoken words prior to the most recent15 words are high and for the last fifteen words are low, the system mayautomatically suggest or implement a split at the 15th most recent wordso that ASR text prior to that word is firmed up and text thereafter isstill presented to the CA to be considered and corrected. Here, the CAmay reject the split either by selecting a rejection option or byignoring the suggestion or may accept the suggestion by selecting anaccept option or by ignoring the suggestion (e.g., where the split isautomatic if not rejected in some period (e.g., 2 seconds)). To thisend, see the exemplary CA screen shot in FIG. 39 where ASR generatedtext is shown at 1332. In this case, the CA is behind in errorcorrection so that the CA computer is currently broadcasting the word“want” as indicted by the “Broadcast” tag 1334 that moves along the ASRgenerated text string to indicate to the CA where the current broadcastpoint is located within the overall string. A “High CF—Catch Up” tag1338 is provided to indicate a point within the overall ASR text stringpresented prior to which ASR confidence factors are high and, afterwhich ASR confidence factors are relatively lower. Here, it iscontemplated that a CA would be able to select tag 1338 to skip to thetagged point within the text. If a CA selects tag 1338, the broadcastmay skip to the associated tagged point so that “Broadcast” tag 1334would be immediately moved to the point tagged by tag 1338 where the HUvoice broadcast would recommence. In other cases, selecting highconfidence tag 1338 may cause accelerated broadcast of text between tags1334 and 1338 to expedite catch up.

Referring to FIG. 40, another exemplary CA screen shot 1333 that may bepresented to show low and high confidence text segments and to enable aCA to skip to low confidence text and associated voice signal isillustrated. Screen shot 1333 divides text into two columns including alow confidence column 1335 and a high confidence column 1337. Lowconfidence column 1335 includes text segments that have ASR assignedconfidence factors that are less than some threshold value which highconfidence column 1337 include text segments that have ASR assignedconfidence factors that are greater than the threshold value. Column1335 is presented on the left half of screen shot 1333 and column 1337is presented on the right half of shot 1333. The two columns wouldscroll upward simultaneously as more text is generated. Again, a currentbroadcast tag 1339 is provided at a current broadcast point in thepresented text. Also, a “High CF, Catch Up” tag 1341 is presented at thebeginning of a low confidence text segment. Here, again, it iscontemplated that a CA may select the high confidence tag 1341 to skipthe broadcast forward to the associated point to expedite the errorcorrection process. As shown, in at least some cases, if the CA does notskip ahead by selecting tag 1341, the HU voice broadcast may be at 2× ormore the speaking speed so that catch up can be more rapid.

In at least some cases it is contemplated that when a call is receivedat an AU device or at a relay, a system processor may use the callingnumber (e.g., the number associated with the calling party or thecalling parties device) to identify the least expensive good option forgenerating text for a specific call. For instance, for a specific firstcaller, a robust and reliable ASR engine voice model may already existand therefore be useable to generate automated text without the need forCA involvement most of the time while no model may exist for a secondcaller that has not previously used the system. In this case, the systemmay automatically initiate captioning using the ASR engine and firstcaller voice model for first caller calls and may automatically initiateCA assisted captioning for second caller calls so that a voice model forthe second caller can be developed for subsequent use. Where thereceived call is from an AU and is outgoing to an HU, a similar analysisof the target HU may cause the system to initiate ASR engine captioningor CA assisted captioning.

In some embodiments identity of an AU (e.g., an AU's phone number orother communication address) may also be used to select which of two ormore text generation options to use to at least initiate captioning.Thus, some AU's may routinely request CA assistance on all calls whileothers may prefer all calls to be initiated as ASR engine calls (e.g.,for privacy purposes) where CA assistance is only needed upon requestfor relatively small sub-periods of some calls. Here, AU phone oraddress numbers may be used to assess optimal captioning type.

In still other cases both a called and a calling number may be used toassess optimal captioning type. Here, in some cases, an AU number oraddress may trump an HU number or address and the HU number or addressmay only be used to assess caption type to use initially when the AU hasno perceived or expressed preference.

Referring again to FIG. 39, it has been recognized that, in addition totext corresponding to an HU voice signal, an optimal CA interface needsadditional information that is related to specific locations within apresented text string. For instance, specific virtual control buttonsneed to be associated with specific text string locations. For example,see the “High CF-Catch Up” button in FIG. 39. As other examples, a“resume” tag 1233 as in FIG. 36 or a correction word (see FIG. 20) mayneed to be linked to a specific text location. As another instance, insome cases a “broadcast” tag indicating the word currently beingbroadcast may have to be linked to a specific text location (see FIG.39).

In at least some embodiments, a CA interface or even an AU interfacewill take a form where text lines are separated by at least one blankline that operates as an “additional information” field in which othertext location linked information or content can be presented. To thisend, see FIG. 39 where additional information fields are collectivelylabelled 1215. In other embodiments it is contemplated that theadditional information fields may also be provided below associated textlines. In still other embodiments, other text fields may be presented asseparate in line fields within the text strings (see 1217 in FIG. 40).

Training, Gamification, CA Scoring, CA Profiles

In many industries it has been recognized that if a tedious job can begamified, employee performance can be increased appreciably as employeeswork through obstacles to increase personal speed and accuracy scoresand, in some cases, to compete with each other. Here, in addition toincreased personal performance, an employing entity can develop insightsinto best work practices that can be rolled out to other employeesattempting to better their performance. In addition, where there areclear differences in CA capabilities under different sets ofcircumstances, CA scoring can be used to develop CA profiles so thatwhen circumstances can be used to distinguish optimal CAs for specificcalls, an automated system can distribute incoming calls to optimal CAsfor those specific calls or can move calls among CAs mid-call so thatthe best CA for each call or parts of calls can be employed.

In the present case, various systems are being designed and tested toadd gamification, scoring and profile generating aspects to the textcaptioning and/or correction processes performed by CAs. In this regard,in some cases it has been recognized that if a CA simply operates inparallel with an ASR engine to generate text, a CA may be tempted tosimply let the ASR engine generate text without diligent errorcorrection which, obviously, is not optimal for AU's receiving systemgenerated text where caption accuracy is desired and even required to beat high levels.

To avoid CAs shirking their error correction responsibilities and tohelp CAs increase their skills, in at least some embodiments it iscontemplated that a system processor that drives or is associated with aCA interface may introduce periodic and random known errors into ASRgenerated text that is presented to a CA as test errors. Here, the ideais that a CA should identify the test errors and at least attempt tomake corrections thereto. In most cases, while errors are presented tothe CA, the errors are not presented to an AU and instead the likelycorrect ASR engine text is presented to the AU. In some cases the systemallows a CA to actually correct the erroneous text without knowing whicherrors are ASR generated and which are purposefully introduced as partof the one of the gamification or scoring processes. Here, by requiringthe CA to make the correction, the system can generate metrics on howquickly the CA can identify and correct caption errors.

In other cases, when a CA selects an introduced text error to make acorrection, the interface may automatically make the correction uponselection so that the CA does not waste additional time rendering acorrection. In some cases, when an introduced error is corrected eitherby the interface or the CA, a message may be presented to the CAindicating that the error was a purposefully introduced error.

Referring to FIG. 41, a method 1350 that is consistent with at leastsome aspects of the present disclosure for introducing errors into anASR text stream for testing CA alertness is illustrated. At block 1352,an ASR engine generates ASR text segments corresponding to an HU voicesignal. At block 1354, a relay processor or ASR engine assignsconfidence factors to the ASR text and at block 1356, the relayidentifies at least one high confidence text segment as a “test”segment. At block 1358, the processor transmits the high confidence testsegment to an AU device for display to an AU. At block 1360, theprocessor identifies an error segment to be swapped into the ASRgenerated text for the test segment to be presented to the CA. Forinstance, where a high confidence test segment includes the phrase “Johncame home on Friday”, the processor may generate an exemplary errorsegment like “John camp home on Friday”.

Referring still to FIG. 41, at block 1362, the processor presents textwith the error segment to the CA as part of an ongoing text stream toconsider for error correction. At decision block 1364, the processormonitors for CA selection of words or phrases in the error segment to becorrected. Where the CA does not select the error segment forcorrection, control passes to block 1372 where the processor stores anindication that the error segment was not identified and control passesback up to block 1352 where the process continues to cycle. In addition,at block 1372, the processor may also store the test segment, the errorsegment and a voice clip corresponding to the test segment that maylater be accessed by the CA or an administrator to confirm the missederror.

Referring again to block 1364 in FIG. 41, if the CA selects the errorsegment for correction, control passes to block 1366 where the processorautomatically replaces the error segment with the test segment so thatthe CA does not have to correct the error segment. Here the test segmentmay be highlighted or otherwise visually distinguished so that the CAcan see the correction made. In addition, in at least some cases, atblock 1368, the processor provides confirmation that the error segmentwas purposefully introduced and corrected. To this end, see the“Introduced Error—Now Corrected” tag 1331 in FIG. 39 that may bepresented after a CA selects an error segment. At block 1370, theprocessor stores an indication that the error segment was identified bythe CA. Again, in some cases, the test segment, error segment andrelated voice clip may be stored to memorialize the error correction.After block 1370, control passes back up to block 1352 where the processcontinues to cycle.

In some cases errors may only be introduced during periods when the rateof actual ASR engine errors and CA corrections is low. For instance,where a CA is routinely making error corrections during a one minuteperiod, it would make no sense to introduce more text errors as the CAis most likely highly focused during that period and her attention isneeded to ensure accurate error correction. In addition, if a CA issubstantially delayed in making corrections, the system may again opt tonot introduce more errors.

Error introductions may include text additions, text deletions (e.g.,removal of text so that the text is actually missing from thetranscript) and text substitutions in some embodiments. In at least somecases the error generating processor or CA interface may randomlygenerate errors of any type and related to any ASR generated text. Inother cases, the processor may be programmed to introduce severaldifferent types of errors including visible errors (e.g., defined aboveas errors that are clear errors when placed in context with other wordsin a text phrase, e.g., the phrase does not make sense when theerroneous text is included), invisible errors (e.g., errors that makesense and a grammatically right in the context of surrounding words),minor errors which are errors that, while including incorrect text, haveno bearing on the meaning of an associated phrase (e.g., “the” swappedfor “a”) and major errors which are errors that include incorrect textand that change the meaning of an associated phrase (e.g., swapping a 5PM meeting time for a 3 PM meeting time). In some cases an error mayhave two designations such as, for instance, visible and major, visibleand minor, invisible and major or invisible and minor.

Because at least some ASR engines can understand context, the enginescan also be programmed to ascertain when a simple text error affectsphrase meaning and can therefore generate and identify different errortypes to test a CAs correction skills. For instance, in some casesintroduced errors may include visible, invisible, minor and major errorsand statistics related to correcting each error type may be maintainedas well as when a correction results in a different error. For instance,an invisible major error may be presented to a CA and the CA mayrecognize that error and incorrectly correct it to introduce a visibleminor error which, while still wrong, is better than the invisible majorerror. Here, statistics would reflect that the CA identified andcorrected the invisible major error but made an error when correctingwhich resulted in a visible minor error. As another instance, a visibleminor error may be incorrectly corrected to introduce an invisible majorerror which would generate a much worse captioning result that couldhave substantial consequences. Here, statistics would reflect that theCA identified and corrected the initial error which is good, but wouldalso reflect that the correction made introduced another error and thatthe new error resulted in a worse transcription result.

In some embodiments gamification can be enhanced by generating ongoing,real time dynamic scores for CA performance including, for instance, ascore associated with accuracy, a separate score associated withcaptioning speed and/or separate speed and accuracy scores underdifferent circumstances such as, for instance, for male and femalevoices, for east coast accents, Midwest accents, southern accents, etc.,for high speed talking and slower speed talking, for captioning withcorrecting versus captioning alone versus correcting ASR engine text,and any combinations of factors that can be discerned. In FIG. 40,exemplary accuracy and speed scores that are updated in real time for anongoing call are shown at 1343 and 1345, respectively. Where a callpersists for a long time, a rolling most recent sub-period of the callmay be used as a duration over which at least current scores arecalculated and separate scores for associated with an entire call may begenerated and stored as well.

CA scores may be stored as part of a CA profile and that profile may beroutinely updated to reflect growing CA effectiveness with experienceover time. Once CA specific scores are stored in a CA profile, thesystem may automatically route future calls that have characteristicsthat match high scores for a specific CA to that CA which shouldincrease overall system accuracy and speed. Thus, for instance, if an HUprofile associated with a specific phone number indicates that anassociated HU has a strong southern accent and speaks rapidly, when acall is received that is associated with that phone number, the systemmay automatically route the call to a CA that has a high gamificationscore for rapid southern accents if such a CA is available to take thecall. In other cases it is contemplated that when a call is received ata relay where the call cannot be associated with an existing HU voiceprofile, the system may assign the call to a first CA to commencecaptioning where a relay processor analyzes the HU voice during thebeginning of the call and identifies voice characteristics (e.g., rapid,southern, male, etc.) and automatically switches the call to a second CAthat is associated with a high gamification score for the specific typeof HU voice. In this case, speed and accuracy would be expected toincrease after the switch to the second CA.

Similarly, if a call is routed to one CA based on an incoming phonenumber and it turns out that a different HU voice is present on the callso that a better voice profile fits the HU voice, the call may beswitched from an initial CA to a different CA that is more optimal forthe HU voice signal. In some cases a CA switch mid-call may only occurif some threshold level of delay or captioning errors is detected. Forinstance, if a first assigned CA's delay and error rate is greater thanthreshold values and a system processor recognizes HU voicecharacteristics that are much better suited to a second available CA'sskill set and profile, the system may automatically transition the callfrom the first CA to the second CA.

In addition, in some cases it is contemplated that in addition to theindividual speed and accuracy scores, a combined speed/accuracy scorecan be generated for each CA over the course of time, for each CA over awork period (e.g., a 6 hour captioning day), for each CA for each callthat the CA handles, etc. For example, an exemplary single scorealgorithm may including a running tally that adds one point for acorrect word and adds zero points for an incorrect word, where thecorrect word point is offset by an amount corresponding to a delay inword generation after some minimal threshold period (e.g., 2 secondsafter the word is broadcast to the CA for transcription or one secondafter the word is broadcast to and presented to a CA for correction).For instance, the offset may be 0.2 points for every second after theminimal threshold period. Other algorithms are contemplated. The singlescore may be presented to a CA dynamically and in real time so that CAis motivated to focus more. In other cases the single score per phonecall may be presented at the end of each call or an average score over awork period may be presented at the end of the work period. In FIG. 40,an exemplary current combined score is shown at 1347.

The single score or any of the contemplated metrics may also be relatedto other factors such as, for instance:

(1) How quickly errors are corrected by a CA;

(2) How many ASR errors need to be corrected in a rolling period oftime;

(3) ASR delays;

(4) How many manufactured or purposefully introduced errors are caughtand corrected;

(5) Error types (e.g., visible, invisible, minor and major)

(6) Correct and incorrect corrections;

(7) Effect of incorrect corrections and non-corrections (e.g., bettercaption or worse caption);

(8) Rates of different types of corrections;

(9) Error density;

(10) Once a CA is behind, how does the CA respond, rate of catchup;

(11) HU speaking rate (WPM);

(12) HU accent or dialect;

(13) HU volume, pitch, tone, changes in audible signal characteristics;

(14) Voice signal clarity (perhaps as measured by the ASR engine);

(15) Communication link quality;

(16) Noise level (e.g., HU operating in high wind environment wherenoise is substantial and persistent);

(17) Quality of captioned sentence structure (e.g., verb, noun, adverb,in acceptable sequence);

(18) ASR confidence factors associated with text generated during a call(as a proxy for captioning complexity), etc.

In at least some embodiments where gamification and training processesare applied to actual AU-HU calls, there may be restrictions on abilityto store captions of actual conversations. Nevertheless, in these cases,captioning statistics may still be archived without saving caption textand the statistics may be used to drive scoring and gamificationroutines. For instance, for each call, call characteristics may bestored including, for instance, HU accent, average HU voice signal rate,highest HU voice signal rate, average volume of HU voice signal, othervoice signal defining parameters, communication line clarity or otherline characteristics, etc. (e.g., any of the other factors listedabove). In addition, CA timing information may be stored for each audiosegment in the call, for captioned words and for corrective CAactivities.

As in the case of the full or pure CA metrics testing and developmentsystem described above, in at least some cases real AU-HU calls may bereplaced by pre-recorded text call data sets where audio is presented toa CA while mock ASR engine text associated therewith is visuallypresented to the CA for correction. In at least some cases, thepre-stored test data set may only include a mocked up HU voice signaland known correct or true text associated therewith and the systemincluding an ASR engine may operate in a normal fashion so the ASRengine generates real time text including ASR errors for the mocked upHU voice signal as a CA views that ASR text and makes corrections. Here,as the CA generates corrected final text, a system processor mayautomatically compare that text to the known correct or true text togenerate CA call metrics including various scoring values.

In other cases, the ASR engine functions may be mimicked by a systemprocessor that automatically introduces known errors of specific typesinto the correct or true text associated with the mocked up HU voicesignal to generate mocked up ASR text that is presented to a CA forcorrection. Here, again, as the CA generates corrected final text, asystem processor automatically compares that text to the known true textto generate CA call metrics including various scoring values.

In still other cases, in addition to storing the test HU voice signaland associated true text, the system may also store a test version oftext associated with the HU voice signal where the test text version hasknown errors of known types and, during a test session, the test textwith errors may be presented to the CA for correction. Here, again, asthe CA generates corrected final text, a system processor automaticallycompares that text to the known true text to generate CA call metricsincluding various scoring values.

In each cases where a mocked up HU voice signal is used during a testsession, the voice signal and CA captioned transcripts can be maintainedand correlated with the CA's results so that the CA and/or a systemadministrator can review those results for additional scoring purposesor to identify other insights into a specific CA's strengths andweaknesses or into CA activities more generally.

In at least some cases CAs may be tested using a testing applicationthat, in addition to generating mock ASR text and ASR corrections for amocked up AU-HU voice call, also simulates other exemplary and common AUactions during the call such as, for instance, switching from an ASR-CAbacked up mode to a full CA captioning and error correction mode. Here,as during a normal call, the CA would listen to HU voice signal and seeASR generated text on her CA display screen and would edit perceivederrors in the ASR text during the ASR-CA backed up mode operation. Here,the CA would have full functionality to skip around within the ASRgenerated text to rebroadcast HU segments during error correction, tofirm up ASR text, etc., just as if the mocked up call were real. At somepoint, the testing application would then issue a command to the CAstation indicating that the AU requires full CA captioning andcorrection without ASR assistance at which point the CA system wouldswitch over to full CA captioning and correction mode. A switch back tothe ASR-CA backed up mode may occur subsequently.

Where pre-recorded mock HU voice signals are fed to a CA, a Truth/Scorerprocessor may be programmed to automatically use known HU voice signaltext to evaluate CA corrections for accuracy as described above. Here, afinal draft of the CA corrected text may be stored for subsequentviewing and analysis by a system administrator or by the CA to assesseffectiveness, timing, etc.

Where scoring is to be applied to a live AU-HU call that does not use apre-recorded HU voice signal so there is no initial “true” texttranscript, a system akin to one of those described above with respectto one of FIG. 30 or 31 may be employed where a “truth” transcript isgenerated either via another CA or an ASR or a CA correcting ASRgenerated text for comparison and scoring purposes. Here, the second CAthat generates the truth transcript may operate at a much slower pacethan the pace required to support an AU as caption rate is not asimportant and can be sacrificed for accuracy. Once or as the second CAgenerates the truth transcript, a system processor may compare the firstCA captioning results to the truth transcript to identify errors andgenerate statistics related thereto. Here, the truth transcript isultimately deleted so that there is no record of the call and all thatpersists is statistics related to the CA's performance in handling thecall.

In other embodiments where scoring is applied to a live AU-HU call thatdoes not have a predetermined “truth” transcript, the second CA mayreceive the first CA's corrected text and listen to the HU voice signalwhile correcting the first CA's corrected text a second time. In thiscase, a processor tracks corrections by the first CA as well asstatistics related to one or any subset of the call factors (e.g., rateof speech, number of ASR text errors per some number of words, etc.)listed above. In addition, the processor tracks corrections by thesecond CA where the second CA corrections are considered the Truthtranscript. Thus, any correction made by the second CA is taken as anerror.

In at least some cases, instead of just identifying CA caption errorsgenerally, either a system processor or a second CA/scorer maycategorize each error as visible (e.g., in context of phrase, errormakes no sense), invisible (e.g., in context of phrase error makes sensebut meaning of phrase changes) or minor (e.g., error that does notchange the meaning of including phrase). Where a scoring second CA hasto identify error type in a case where a mock AU-HU call is used as thesource for CA correction, a processor may present a screenshot to thesecond CA where all errors are identified and as well as tallying toolsfor adding each error to one of several error type buckets.

To this end, see FIG. 51 where an exemplary CA scoring screen shot 1568is illustrated. The screen shot 1568 includes a CA text transcript at1572 that includes corrections by a first CA that is being scored by aCA scorer (e.g., a system manager or administrator). While scoring thetext, the scorer listens to the HU voice signal via a headset and, in atleast some cases, a word associated with a currently broadcast HU voicesignal is highlighted to aid the scorer in following along. In theillustrated embodiment, a system processor compares the CA correctedtext to a truth transcript and identifies transcription errors. Eacherror in FIG. 51 is visually distinguished. For instance, see exemplaryfield indicators 1574, 1576, 1578 and 1580, each of which represents anerror.

Referring still to FIG. 51, as the scorer works her way through the CAtext transcript considering each error, the scorer uses judgement todetermine if the error is a major error or a minor error and designateseach error either major or minor. For instance, a scorer may use a mouseor touch to select each error and then use specific keyboard keys toassign different error types to each error. In the illustrated example,a “V” keyboard selection designates an error as a major error while an“F” selection designates the error as a minor error. In FIG. 51, eachtime an error type is assigned to an error, a V1 or F1 designator isspatially associated with the error on the screen shot 1568 so that theerror type is clear. In addition, when an error type is assigned to anerror, the designated error is visually distinguished in a differentfashion to help the scorer track which errors have been characterizedand which have not. For instance, in FIG. 51, fields 1574 and 1576 areshown as left up to right cross hatched to indicate a red colorindicating that associated errors have been categorized while fields1578 and 1580 are shown left down to right cross hatched to indicate ablue color reserved for errors that have yet to be considered andcategorized by the scorer.

In addition, when an error type is assigned to an error, a counterassociated with the error type is incremented to indicate a total countfor that specific type of error. To this end, a counter field 1570 ispresented along the top edge of the screen shot 1568 that includesseveral counters including a major error counter and a minor errorcounter at 1598 and 1600, respectively. The final counts are used togenerate various metrics related to CA quality and effectiveness.

In at least some cases a scorer may be able to select an error field toaccess associated text from the truth transcript that is associated withthe error. To this end, see in FIG. 51 where hand icon 1594 indicatesuser selection of error field 1578 which opens up truth text field 1596in which associated truth text is presented. In the example, the name“Jane” is the truth text for the error “Jam”. Thus, the scorer caneither listen to the broadcast voice or view truth text to compare toerror text for assessing error type.

Referring still to FIG. 51, missing text is also an error and isrepresented by the term “% missing” as shown at 1580. Here, again, thescorer can select the missing text field to view truth text associatedtherewith in at least some embodiments.

A “non-error” is erroneous text that could not possibly be confusing tosomeone reading a caption. For instance, exemplary non-errors includealternate spellings of a word, punctuation, spelled out numbers insteadof numerals, etc. Here, while the system may flag non-errors between atruth text and CA generated text, the scorer may un-flag those errors asthey are effectively meaningless. The idea here is that on balance, itis better to have faster captioning with some non-errors than slowercaptioning where there are no non-errors and therefore, at a minimum,CAs should not be penalized for purposefully or even unintentionallyallowing non-errors. When a scorer un-flags a non-error, the appearanceof the non-error is changed so that it is not visually distinguishedfrom other correct text in at least some embodiments. In addition, whena scorer un-flags a non-error, a value in a non-error count field 1602is incremented by one.

In at least some cases a scorer can highlight word or phrases in a textcaption causing a processor to indicate durations of silence prior tothe selected word or each word in a selected phrase. To this end, see,for instance, the highlighted phrase “may go out and catch a movie” inFIG. 51 where pre-word delays are shown before each word in thehighlighted phrase including, for instance, delays 1605 and 1607corresponding to the words “may” and “go”, respectively. Here, a scorercan use the delays to develop a sense of whether or not words repeatedin CA corrected text are meaningful. For instance, where a CA correctedtranscript includes the phrase “no no”, whether or not this wordduplication is meaningful or not may be dependent upon the delay betweenthe two words. For instance, where there is no delay between the words,the duplication was not necessary as one “no” would have gotten themeaning across. On the other hand, where there is a several second delaybetween the first and second “no” utterances in the HU voice signal,that indicated that each word was a separate answer (e.g., the end ofone sentence and the beginning of another). A scorer can use this typeof information as another metric for scoring CA performance.

One other way to monitor CA attention is to present random or periodicindicators into the ASR engine text that the CA has to recognize withinthe text in some fashion to confirm the CA's attention. For instance,referring again to FIG. 36, in some cases a separate check box may bepresented for each ASR transcript line of text as shown at 1610 where aCA has to select each box to place an “X” therein to indicate that theline has been examined. In other cases check boxes may be interspersedthroughout the transcript text presented to the CA and the CA may needto select each of those boxes to confirm her attention.

Other AU Device Features and Processes

In at least some of the embodiments described above an AU has the optionto request CA assistance or more CA assistance than currently affordedon a call and or to request ASR engine text as opposed to CA generatedtext (e.g., typically for privacy purposes). While a request to changecaption technique may be received from a CA, in at least some cases thealternative may not be suitable for some reason and, in those cases, thesystem may forego a switch to a requested technique and provide anindication to a requesting AU that the switch request has been rejected.For instance, if an AU receiving CA generated and corrected textrequests a switch to an ASR engine but accuracy of the ASR engine isbelow some minimal threshold, the system may present a message to the AUthat the ASR engine cannot currently support captioning and the CAgeneration and correction may persist. In this example, once the ASRengine is ready to accurately generate text, the switch thereto may beeither automatic or the system may present a query to the AU seekingauthorization to switch over to the ASR engine for subsequentcaptioning.

In a similar fashion, if an AU requests additional CA assistance, asystem processor may determine that ASR engine text accuracy is low forsome reason that will also affect CA assistance and may notify the AUthat the a switch will not be made along with a reason (e.g.,“Communication line fault”).

In cases where privacy is particularly important to an AU on a specificcall or generally, the caption system may automatically, upon requestfrom an AU or per AU preferences stored in a database, initiate allcaptioning using an ASR engine. Here, where corrections are required,the system may present short portions of an HU's voice signal to aseries of CAs so that each CA only considers a portion of the text forcorrection. Then, the system would stitch all of the CA corrected texttogether into an HU text stream to be transmitted to the AU device fordisplay.

In some cases it is contemplated that an AU device interface may presenta split text screen to an AU so that the AU has the option to viewessentially real time ASR generated text or CA corrected text when thecorrected text substantially lags the ASR text. To this end, see theexemplary split screen interface 1450 in FIG. 45 where CA corrected textis shown in an upper field 1452 and “real time” ASR engine text ispresented in a lower field 1454. As shown, a “CA location” tag 1456 ispresented at the end of the CA corrected text while a “Broadcast” tag1458 is presented at the end of the ASR engine text to indicate the CAand broadcast locations within the text string. Where CA correctionlatency reaches a threshold level (e.g., the text between the CAcorrection location and the most recent ASR text no longer fits on thedisplay screen), text in the middle of the string may be replaced by aperiod indicator 1460 to indicate the duration of HU voice signal at thespeaking speed that corresponds to the replaced text. Here, as the CAmoves on through the text string, text in the upper field 1452 scrollsup and as the HU continued to speak, the ASR text in the bottom field1454 also scrolls up independent of the upper field scrolling rate.

In at least some cases it is contemplated that an HU may use acommunication device that can provide video of the HU to an AU during acall. For instance, an HU device may include a portable tablet typecomputing device or smart phone (see 1219 in FIG. 33) that includes anintegrated camera for telepresence type communication. In other cases,as shown in FIG. 33, a camera 1123 may be linked to the HU phone orother communication device 14 for collecting HU video when activated.Where HU video is obtained by an HU device, in most cases the video andvoice signals will already be associated for synchronous playback. Here,the HU voice and video signals are transmitted to an AU device, the HUvideo may be broken down into video segments that correspond with timestamped text and voice segments and the stamped text, voice and videosegments may be stored for simultaneous replay to the AU as well as to aCA if desired. Here, where there are delays between broadcast ofconsecutive HU voice segments as text transcription progresses, in atleast some cases the HU video will freeze during each delay. In othercases the video and audio voice signal may always be synchronized evenwhen text is delayed. If the HU voice signal is sped up during a catchup period as described above, the HU video may be shown at a fasterspeed so that the voice and video broadcasts are temporally aligned.

FIG. 42 shows an exemplary AU device screen shot 1308 includingtranscribed text 1382 and a video window or field 1384. Here, assumingthat all of the shown text at 1382 has already been broadcast to the AU,if the AU selects the phrase “you should bing the cods along” asindicate by hand icon 1386, the AU device would identify the voicesegment and video segment associated with the selected text segment andreplay both the voice and video segments while the phrase remainshighlighted for the user to consider.

In at least some embodiments where low confidence factors are assignedto captions presented to an AU, low confidence words or phrases may bevisually distinguished for the AU so that the AU is at least aware ofthe fact that the words or phrases may be inaccurate. To this end, seein FIG. 42 that the words “articulate” 1385 and “everywhere” 1387 areshown in double hatch effect to indicate that those words are visuallydistinguished for the AU as an indication that each of those words isassociated with a low confidence factor and therefore is likely moreinaccurate than other caption text.

Referring yet again to FIG. 33, in some cases the AU device or AUstation may also include a video camera 1125 for collecting AU videothat can be presented to the HU during a call. Here, it is contemplatedthat at least some HUs may be reticent to allow an AU to view HU videowithout having the reciprocal ability to view the AU during an ongoingcall and therefore reciprocal AU viewing would be desirable.

At least four advantages result from systems that present HU video to anAU during an ongoing call. First, where the video quality is relativelyhigh, the AU will be able to see the HU's facial expressions which canincrease the richness of the communication experience.

Second, in some cases the HU representation in a video may be useable todiscern words intended by an HU even if a final text representationthereof is inaccurate. For instance, where a text transcription erroroccurs, an AU may be able to select the phrase including the error andview the HU video associated with the selected phrase while listening tothe associated voice segment and, based on both the audio and videorepresentations, discern the actual phrase spoken by the HU.

Third, it has been recognized that during most conversations, peopleinstinctively provide visual cues to each other that help participantsunderstand when to speak and when to remain silent while others arespeaking. In effect, the visual cues operate to help people take turnsduring a conversation. By providing video representations to each of anHU and an AU during a call, both participants can have a good sense ofwhen their turn is to talk, when the other participant is strugglingwith something that was said, etc. Thus, for instance, in many cases anHU will be able to look at the video to determine if an AU is silentlywaiting to view delayed text and therefore will not have to ask if thereis a delay in AU communication.

Fourth, for deaf AU's that are trained to read lips, the HU video may beuseable by the AU to enhance communication.

In at least some cases an AU device may be programmed to query an HUdevice at the beginning of a communication to determine if the HU devicehas a video camera useable to generate an HU video signal. If the HUdevice has a camera, the AU device may cause the HU device to issue aquery to the HU requesting access to and use of the HU device cameraduring the call. For instance, the query may include brief instructionsand a touch selectable “Turn on camera” icon or the like for turning onthe HU device camera. If the HU rejects the camera query, the system mayoperate without generating and presenting an HU video as describedabove. If the HU accepts the request, the HU device camera is turned onto obtain an HU video signal while the HU voice signal is obtained andthe video and voice signal are transmitted to the AU device for furtherprocessing.

There are video relay systems on the market today where speciallytrained CAs provide a sign language service for deaf AUs. In thesesystems, while an HU and an AU are communicating via a communicationlink or network, an HU voice signal is provided to a CA. The CA listensto the HU voice signal and uses her hands to generate a sequence ofsigns that correspond at least roughly to the content (e.g., meaning) ofthe HU voice messages. A video camera at a CA station captures the CAsign sequence (e.g., “the sign signal”) and transmits that signal to anAU device which presents the sign signal to the AU via a display screen.If the AU can speak, the AU talks into a microphone and the AU's voiceis transmitted to the HU device where it is broadcast for the HU tohear.

In at least some cases it is contemplated that a second or even a thirdcommunication signal may be generated for the HU voice signal that canbe transmitted to the AU device and presented along with the sign signalto provide additional benefit to the AU. For instance, it has beenrecognized that in many cases, while sign language can come close to themeaning expressed in an HU voice signal, in many cases there is no exacttranslation of a voice message to a sign sequence and therefore somemeaning can get lost in the voice to sign signal translation. In thesecases, it would be advantageous to present both a text translation and asign translation to an AU.

In at least some cases it is contemplated that an ASR engine at a relayor operated by a fourth party server linked to a relay may, in parallelwith a CA generating a sign signal, generate a text sequence for an HUvoice signal. The ASR text signal may be transmitted to an AU devicealong with or in parallel with the sign signal and may be presentedsimultaneously as the text and sign signals are generated. In this way,if an AU questions the meaning of a sign signal, the AU can refer to theASR generated text to confirm meaning or, in many cases, review anactual transcript of the HU voice signal as opposed to a sometimes lessaccurate sign language representation.

In many cases an ASR will be able to generate text far faster than a CAwill be able to generate a sign signal and therefore, in at least somecases, ASR engine text may be presented to an AU well before a CAgenerated sign signal. In some cases where an AU views, reads andunderstands text segments well prior to generation and presentation of asign signal related thereto, the AU may opt to skip ahead and foregosign language for intervening HU voice signal. Where an AU skips aheadin this fashion, the CA would be skipped ahead within the HU voicesignal as well and continue signing from the skipped to point on.

In at least some cases it is contemplated that a relay or other systemprocessor may be programmed to compare text signal and sign signalcontent (e.g., actual meaning ascribed to the signals) so that timestamps can be applied to text and sign segment pairings thus enabling anAU to skip back through communications to review a sign signalsimultaneously with a paired text tag or other indicator. For instance,in at least some embodiments as HU voice is converted by a CA to signsegments, a processor may be programmed to assess the content (e.g.,meaning) of each sign segment. Similarly, the processor may also beprogrammed to analyze the ASR generated text for content and to thencompare the sign segment content to the text segment content to identifymatching content. Where sign and text segment content match, theprocessor may assign a time stamp to the content matching segments andstore the stamp and segment pair for subsequent access. Here, if an AUselects a text segment from her AU device display, instead of (or inaddition to in some embodiments) presenting an associated HU voicesegment, the AU device may represent the sign segment paired with theselected text.

Referring again to FIG. 33, the exemplary CA station includes, amongother components, a video camera 55 for taking video of a signing CA tobe delivered along with transcribed text to an AU. Referring also andagain to FIG. 42, a CA signing video window is shown at 1390 alongside atext field that includes text corresponding to an HU voice signal. InFIG. 42, if an AU selects the phrase labelled 1386, that phrase would bevisually highlighted or distinguished in some fashion and the associatedor paired sign signal segment would be represented in window 1390.

In at least some video relay systems, in addition to presenting sign andtext representations of an HU voice signal, an HU video signal may alsobe used to represent the HU during a call. In this regard, see againFIG. 42 where both an HU video window 1384 and a CA signing window 1390are presented simultaneously. Here, all communication representations1382, 1384 and 1390 may always be synchronized via time stamps in somecases while in other cases the representations may not be completelysynchronized. For instance, in some cases the HU video window 1384 mayalways present a real time representation of the HU while text and signsignals are 1382 and 1390 are synchronized and typically delayed atleast somewhat to compensate for time required to generate the signsignal as well as AU replay of prior sign signal segments.

In still other embodiments it is contemplated that a relay or othersystem processor may be programmed to analyze sign signal segmentsgenerated by a signing CA to automatically generate text segments thatcorrespond thereto. Here the text is generated from the sign signal asopposed to directly from the voice signal and therefore would match thesign signal content more closely in at least some embodiments. Becausethe text is generated directly from the sign signal, time stamps appliedto the sign signal can easily be aligned with the text signal and therewould be no need for content analysis to align signals. Instead of usingcontent to align, a sign signal segment would be identified and a timestamp applied thereto, then the sign signal segment would be translatedto text and the resulting text would be stored in the system databasecorrelated to the corresponding sign signal segment and the time stampfor subsequent access.

FIG. 44 shows yet another exemplary AU screen shot 1400 where textsegments are shown at 1402 and an HU video window is shown at 1412. Thetext 1402 includes a block of text where the block is presented in threevisually distinguished ways. First, a currently audibly broadcast wordis highlighted or visually distinguished in a first way as indicated at1406. Second, the line of text that includes the word currently beingbroadcast is visually distinguished in a second way as shown at 1404.Other text lines are presented above and below the line 1404 to showpreceding text and following text for context. In addition, the line at1404 including the currently broadcast word at 1406 is presented in alarger format to call an AU's attention to that line of text and theword being broadcast. The larger text makes it easier for an AU to seethe presented text. Moreover, the text block 1402 is controlled toscroll upward while keeping the text line that includes the currentlybroadcast word generally centrally vertically located on the AU devicedisplay so that the AU can simply train her eyes at the central portionof the display with the transcribed words scrolling through the field1404. In this case, a properly trained AU would know that priorbroadcast words can be rebroadcast by tapping a word above field 1404and that the broadcast can be skipped ahead by tapping one of the wordsbelow field 1404. Video window 1412 is provided spatially close to field1404 so that the text presented therein is intuitively associated withthe HU video in window 1412.

Referring still to FIG. 44, in at least some embodiments a captioneddevice camera 2200A may be arranged along a side edge of the devicedisplay screen and an HU telepresence type filed 1412 may be positionedalong that edge in a location selected to enhance the sense of eyecontact when an AU looks toward the field 1412 at the presented HUvideo. In addition, the stationary text line 1404 may be presented onthe display adjacent the telepresence field 1412 and at a verticalheight that is similar to the height of the camera lens 2200A. In thisway, regardless of whether or not the AU is looking at the HUrepresentation in field 1412 or the text in field 1404, the HU will havea sense that the AU is generally looking in the direction of the HU andat least at times making eye contact with the HU.

In still other embodiments it is contemplated that an AU captioneddevice may include two or more differently located cameras (see 2200,2200A, 2200B, 2200C in FIG. 44) and the AU may have the option toarrange content windows in any of several different ways. For instance,the AU may be able to remove telepresence field 1412 entirely from thescreen, select and drag the telepresence field to different locations onthe screen and change the size of the telepresence field. Where the usermoves the telepresence field to a location adjacent a different one ofthe device cameras, the device processor may automatically select theproximate camera for generating AU images/video to send to the HU duringa call so that the illusion of eye contact is maintained and optimizedto the extent possible. Similarly, the AU may be able to drag thecaption field around to different screen locations and change the fieldsize, font, text size, etc., to meet user preferences. Where one of thefields is moved, other fields may automatically resize and move toaccommodate new placement of the moved field.

In at least some embodiments where HU voice signal is broadcastessentially immediately or with minimal delay once received at the AUdevice, the highlighted line 1404 may always be the most recent line oftext captioned either via a CA or an ASR, regardless of what captionwords are currently being considered by a CA for error correction.

In at least some embodiments it is contemplated that when a CA replacesan ASR engine to generate text for some reason where the CA revoices anHU voice signal to the ASR engine to generate the text, instead ofproviding the voice signal re-voiced by the CA to an ASR engine at therelay, the CA revoicing signal may be routed to the ASR engine that wasbeing used prior to convert the HU voice signal to text. Thus, forinstance, where a system was transmitting an HU voice signal to a fourthparty ASR engine provider when a CA takes over text generation viare-voicing, when the CA voices a word, the CA voice signal may betransmitted to the fourth party provider to generate transcribed textwhich is then transmitted back to the relay and on to the AU device forpresentation.

In at least some cases it is contemplated that a system processor maytreat at least some CA inputs into the system differently as a functionof how well the ASR is likely performing. For instance, as describedabove, in at least some cases when a CA selects a word in a texttranscript on her display screen for error correction, in normaloperation, the selected word is highlighted for error correction. Here,however, in some cases what happens when a CA selects a text transcriptword may be tied to the level of perceived or likely errors in thephrase that includes the selected word. Where a processor determinesthat the number of likely errors in the phrase is small, the system mayoperate in the normal fashion so that only the selected word orsub-phrase (e.g., after word selection and a swiping action) ishighlighted and prepared for replacement or correction and where theprocessor determines that the number of likely errors in the phrase islarge (e.g., the phrase is predictably error full), the system mayoperate to highlight the entire error prone phrase for error correctionso that the CA does not have to perform other gestures to select theentire phrase. Here, when an entire phrase is visually distinguished toindicate ability to correct, the CA microphone may be automaticallyunmuted so the CA can revoice the HU voice signal to rapidly generatecorrected text.

In other cases, while a simple CA word selection may cause that word tobe highlighted, some other more complex gesture after word selection maycause the phrase including the word to be highlighted for editing. Forinstance, a second tap on a word that immediately follows the wordselection may cause a processor to highlight an entire word containingphrase for editing. Other gestures for phrase, sentence, paragraph,etc., selection are contemplated.

In at least some embodiments it is contemplated that a system processormay be programmed to adjust various CA station operating parameters as afunction of a CA's stored profile as well as real time scoring of CAcaptioning. For instance, CA scoring may lead to a CA profile thatindicates a preferred or optimal rate of HU voice signal broadcast(e.g., in words per minute) for a specific CA. Here, the system mayautomatically use the optimal broadcast rate for the specific CA. Asanother instance, a processor may monitor the rate of CA captioning, CAcorrecting and CA error rates and may adjust the rate of HU voice signalbroadcast that results in optimal time and error rate statistics. Here,the rate may be increased during a beginning portion of a CA'scaptioning shift until optimal statistics result. Here, if statisticsfall off at any time, the system may slow the HU voice signal broadcastrate to maintain errors within an acceptable range.

In some cases a CA profile may specify separate optimal system settingsfor each of several different HU voice signal types or signalcharacteristics subsets. For instance, for a first CA, a first HU voicesignal broadcast rate may be used for a Hispanic HU voice signal while asecond relatively slower HU voice signal broadcast rate may be used fora Caucasian HU voice signal. Many other HU voice signal characteristicsubsets and associated optimal station operating characteristics arecontemplated.

ASR-CA Backed Up Mode

While several different types of semi-automated systems have beendescribed above, one particularly advantageous system includes anautomatic speech recognition system that at least initially handlesincoming HU voice signal captioning where the ASR generated text iscorrected by a CA and where the CA has the ability to manually (e.g.,via selectin of button or the like) take over captioning whenever deemednecessary. Hereinafter, unless indicated otherwise, this type of ASRtext first and CA correction second system will be referred to as anASR-CA backed up mode. Advantages of an ASR-CA backed up mode includethe following. First, initial caption delay is minimized and remainsrelatively consistent so that captions can be presented to an AU asquickly as possible. To this end, ASR engines generate initial captionsrelatively quickly when compared to CA generated text in most cases insteady state.

Second, caption errors associated with current ASR engines can beessentially eliminated by a CA that only corrects ASR errors in mostcases and final corrected text can be presented to an AU rapidly.

Third, by combining rapid ASR text with the error correction skills of aCA, it is possible to mix those capabilities in different ways toprovide optimal captioning speed and accuracy regardless ofcharacteristics of different calls that are fielded by the captioningsystem.

Fourth, the combination of rapid ASR text and CA error correctionenables a system where an AU can customize their captioning system inmany different ways to suit their own needs and system expectations toenhance their communication capabilities.

While various aspects of an ASR-CA backed up mode have been describedabove, some of those aspects are described in greater detail andadditional aspects are described hereafter.

While an ASR engine is typically much faster at generating initialcaption text than a CA, in at least some specific cases a CA may in factbe faster than an ASR engine. Whether or not CA captioning is likely tobe faster than ASR captioning is often a function of several factorsincluding, for instance, a CA's particular captioning strengths andweaknesses as well as characteristics of an HU voice signal that is tobe captioned. For instance, a specific first CA may typically rapidlycaption Hispanic voice signals but may only caption Midwestern voicesignals relatively slowly so that when captioning a Hispanic signal theCA speed can exceed the ASR speed while the CA typically cannot exceedthe ASR speed when captioning a Midwestern voice signal. As anotherinstance, while an ASR may caption high quality HU voice signal fasterthan the first CA, the first CA may caption low quality HU voice signalfaster than the ASR.

As described above, in some cases the system may present an option (seecaption source switch button 751 in FIG. 23A) for a CA to change fromthe ASR generating original text and the CA correcting that text to asystem where the CA generates original text and corrects errors and inother cases a system processor may automatically change the system overto CA original and corrected text when the ASR is too slow, isgenerating too many meaningful (e.g. “visible”, changing the meaning ofa phrase) transcription errors, or any combination of both. In stillother cases a system processor that determines that a specific CA, basedon CA strengths and HU voice signal characteristics, would likely beable to generate initial text faster than the ASR, may be programmed tooffer a suggestion to the CA to switch over.

Thus, in some cases the caption source switch button 751 in FIG. 23A mayonly be presented to a CA as an option when a system processordetermines that the specific CA should be able to generate fasterinitial captions for an HU voice signal. In an alternative, button 751may always be presented to a CA but may have two different appearancesincluding the full button for selection and a greyed out appearance toindicate that the button is not selectable. Here, by presenting thegreyed out button when not selectable a user will not be confused whenthat button is absent.

In some cases it may be that it has to be likely a CA can speed uptranscription appreciably prior to presenting button 751 so that smallpossible increases in speed do not cause a suggestion to be presented tothe CA which could simply distract the CA from error correction. Forinstance, in an exemplary case, a processor may have to calculate thatit is likely a specific CA can speed up transcription by 15% or more inorder to present button 751 to the CA for selection.

In some cases the system processor may take into account more thaninitial captioning speed when determining when to present caption sourceswitch button 751 to a CA. For instance, in some cases the processor mayaccount for some combination of speed and some factor related to thenumber of transcription errors generated by an ASR to determine when topresent button 751. Here, how speed and accuracy factors are weighed todetermine when button 751 should be presented to a user may be a matterof designer choice and should be set to create a best possible AUexperience.

In at least some cases it is contemplated that when the systemautomatically switches to full CA captioning and correction or the CAselects button 751 to switch to full CA captioning and correction, theASR may still operate in parallel with the CA to generate a secondinitial version (e.g., a second to the CA generated captions) of the HUvoice signal and the system may transmit whichever captions aregenerated first (e.g., ASR or CA) to the AU device for presentation.Here, it has been recognized that even when a CA takes over fullcaptioning and correction, which captioning is fastest, ASR or CA, mayswitch back and forth and, in that case, the fastest captions shouldalways be provided to the AU.

As recognized above, in at least some cases third party (e.g., a serverin the cloud) ASR engines have at least a couple of shortcomings. First,third party ASR engine accuracy tends to decrease at the end ofrelatively long voice signal segments to be transcribed.

Second, ASR engines use context to generate final transcription resultsand therefore are less accurate when input voice segments are short. Tothis end, initial ASR results for a word in a voice signal are typicallybased on phonetics and then, once initial results for severalconsecutive words in a signal are available, the ASR engine uses thecontext of the words together as well as additional characteristics ofthe voice of the speaker generating the voice signal to identify a bestfinal transcription result for each word. Where a voice segment in anASR request is short, the signal includes less context in the segmentfor accurately identifying a final result and therefore the results tendto be less accurate.

Third, final results tend to be generated in clumps which means thatautomated ASR error corrections presented to a CA or an AU tend to bepresented I spurts which can be distracting. For instance, if fiveconsecutive words are changed in text presented on an AU's devicedisplay at the same time, the changes can be distracting.

As described above, one solution to the third party ASR shortcomings isto divide an HU voice signal into signal slices that overlap to avoidinaccuracies related to long duration signal segments. In addition, tomake sure that all final transcription results are contextuallyinformed, each segment slice should be at least some minimum segmentlength to ensure sufficient context. Ideally, segment slices sent to theASR engine as transcription requests would include a predefined numberof words within a range (e.g., 3 to 15 words) where the range isselected to ensure at least some level of context to inform the finalresult. Unfortunately, an HU voice signal is not transcribed prior tosending it to the ASR engine and therefore there is no way to ascertainthe number of words in a voice segment prior to receiving transcriptionresults back from the ASR.

For this reason segment slices have to be time based as opposed to wordcount based where the time range of each segment is selected so that itis likely the segment includes an optimal number (e.g., 3 to 15 words)of words spoken by an HU. In at least some cases the time range will bebetween 1 and 10 seconds and, in particularly advantageous cases, therange is between 1 and 3 seconds.

Once initial and/or final transcription results are received back at arelay for one or more HU voice signal segments, a relay processor maycount the number of words in the transcription and automatically adjustthe duration of each HU voice signal segment up or down to adapt to theHU's rate of speech so that each subsequent segment slice has thegreatest chance of including an optimal number of words. Thus, forinstance, where an HU talks extremely quickly, an initial segment sliceduration of four seconds may be shortened to a two second duration.

In at least some cases a relay may only use central portions of ASRtranscribed HU voice signal slices for final transcription results toensure that all final transcribed words are contextually informed. Thus,for instance, where a typical voice signal slice includes 12 words, therelay processor may only use the third through ninth words in anassociated transcription to correct the initial transcription so thatall of the words used in the final results are context informed.

As indicated above, consecutive HU voice segment slices sent to ASRengines may be overlapped to ensure no word is missed. Overlappingsegments also has the advantage that more context can be presented foreach final transcription word. At the extreme the relay may transmit aseparate ASR transcription request for each sub-period that is likely tobe associated with a word (e.g., based on HU speaking rate or average HUspeaking rate) and only one or a small number of transcribed words in areturned text segment may be used as the final transcription result. Forinstance, where overlapping segments each return an average of sevenfinal transcribed words, the relay may only use the middle three ofthose words to correct initial text presented to the CA and the AU.

Where ASR transcription requests include overlapping HU voice signalsegments, consecutive requests will return duplicative transcriptions ofthe same words. In at least some cases the relay processor receivingoverlapping text transcriptions will identify duplicative wordtranscriptions and eliminate duplication in initial text presented tothe CA and the AU as well as in final results.

In at least some cases it is contemplated that overlapping ASR requestsmay correspond to different length HU voice signal segments where someof the segment lengths are chosen to ensure rapid (e.g., essentiallyimmediate) captions and rapid intermediate correction results whileother lengths are chosen to optimize for context informed accuracy infinal results. To this end, a first set of ASR requests may includeshort HU voice signal slices to expedite captioning and intermediatecorrection speed albeit while sacrificing some accuracy, and a secondset of ASR requests may be relatively longer so that context informedfinal text is optimally identified.

Referring to FIG. 46, a schematic is shown that includes a single HUvoice signal line of text where the text is divided into signal segmentsor slices including first through sixth short slices and first, secondand third long slices. The first long slice includes voice signalassociated with the first through third short slices. The first longslice includes many words usable for immediate initial transcription aswell as for final contextual transcription correction. Each long sliceword is transmitted to an ASR engine essentially immediately as the HUvoices the segment (e.g., a link to the ASR is opened at the beginningof the long slice and remains open as the HU voices the slice). Initialtranscription of each word in the first long slice is almost immediateand is fed immediately to the CA for manual correction and to the AU asan initial text transcription irrespective of transcription errors thatmay exist. As more first slice words are voiced and transmitted to theASR engine, those words are immediately transcribed and presented to theCA and AU and are also used to provide context for previouslytranscribed words in the first long slice so that errors in the priorwords can be corrected.

Referring still to FIG. 46, the second long slice overlaps the firstlong slice and includes a plurality of words that correspond to a secondslice duration. To handle the second long slice transcription, a secondASR request is transmitted to an ASR engine as the HU voices each wordin the second slice and substantially real time or immediate text istransmitted back from the engine for each received word. In addition, asthe second slice words are transcribed, those words are also used by theASR engine to contextually correct prior transcribed words in the secondslice to eliminate any perceived errors and those corrections are usedto correct text presented to the CA and the AU.

The third long slice overlaps the second long slice and includes aplurality of words that correspond to a third slice duration. To handlethe third long slice transcription, a third ASR request is transmittedto an ASR engine as the HU voices each word in the third slice andsubstantially real time or immediate text is transmitted back from theengine for each received word. In addition, as the third slice words aretranscribed, those words are also used by the ASR engine to contextuallycorrect prior transcribed words in the third slice to eliminate anyperceived errors and those corrections are used to correct textpresented to the CA and the AU.

It should be apparent from FIG. 46 because long slices overlap, two (andin some cases more) transcriptions for many HU voice signal words willbe received by a relay from one or more ASR engines and therefore arelay processor has to be programmed to select which of the two or moreinitial transcriptions for a word to present to a CA and an AU and whichof two or more final transcriptions for the word to use to correct textalready presented to the CA and AU. In at least some embodiments therelay processor may be programmed to select the first long slice in anHU voice signal for generating initial transcription text for all firstlong slice words, the second long slice in the voice signal forgenerating initial transcription text for all second long slice wordsthat follow the end time of the first long slice and the third longslice in the voice signal for generating initial transcription text forall third long slice words that follow the end time of the second longslice.

In an alternative system, the relay processor may be programmed toselect the first long slice in an HU voice signal for generating initialtranscription text for all first long slice words prior to the starttime of the second long slice, the second long slice in the voice signalfor generating initial transcription text for all second long slicewords prior to the start time of the third long slice and the third longslice in the voice signal for generating initial transcription text forall third long slice words.

In yet one other alternative system, for words that are included inoverlapping signal slices, the relay processor may pass on the firsttranscription of any word that is received by any ASR engine to the CAand AU devices to be presented irrespective of which slice included theword. Here, a second or other subsequent initial transcription of analready presented word may be completely ignored or may be used tocorrect the already presented word in some cases.

Referring again to FIG. 46, regarding final ASR text results for errorcorrection, the first long slice transcription includes more contextualcontent than the second long slice for about the first two thirds of thefirst slice voice signal, the second long slice transcription includesmore contextual content than the first and third long slices for aboutthe central half of the second slice voice signal and so on. Thus, toprovide most accurate ASR transcription error correction, the relayprocessor may be programmed to use final ASR text from sub-portions ofeach long signal slice for error correction including final ASR textfrom the about the first two thirds of the first long slice, about thecentral half of the second long slice and about the last two thirds ofthe third long slice. Here, because the slices are time based as opposedto word based, the exact sub-portion of each overlapping slice used forfinal text results can only be approximate until the text results arereceived back from the ASR engines.

Thus, it should be appreciated that different overlapping voice segmentsor slices may be used to generate initial and final transcriptions ofwords in at least some embodiments where the segments are selected tooptimize for different purposes (e.g., speed or contextual accuracy).

Referring still to FIG. 46, while shown as consecutive and distinct,consecutive short slices may overlap at least somewhat as describedabove. Each short slice has a relatively short duration (e.g., 1-3seconds) and is transmitted to an ASR engine as the HU voices thesegment (e.g., a link to the ASR is opened at the beginning of the sliceand remains open as the HU voices the slice). Here, initialtranscription of each word in a short segment is almost immediate andcould in some cases be used to provide the initial transcription ofwords to a CA and an AU in at least some embodiments. The advantage ofshorter voice signal slices in ASR transcription requests is that theASR should be able to generate more rapid final text transcriptions forwords in the shorter segments so that error corrections in textpresented to the CA and the AU are completed more rapidly. Thus, forinstance, while an ASR may not finalize correction of text at thebeginning of the first long slice in FIG. 46 until just after that sliceends so that all of the contextual information in that slice isconsidered, a different ASR handling the first short slice wouldcomplete its contextual error correction just after the end time of thefirst short slice. Here because short slice final text is generatedrelatively rapidly and only affects a small text segment, it can be usedto reduce the amount of sporadic large magnitude error corrections thatcan be distracting to a CA or and AU. In other words, short slice finaltext error correction is more regular and generally of smaller magnitudethan long slice final text error correction.

As explained above, one problem with short voice signal slices is thatthere is not enough content (e.g., additional surrounding words) in ashort slice to result in highly accurate final text. Nevertheless, evenshort slice context results in better accuracy than initialtranscription in most cases and can operate as an intermediate textcorrection agent to be followed up by long slice final text errorcorrection. To this end, referring yet again to FIG. 46, in at leastsome embodiments the long text segments may be used to generate initialtranscribed text presented to a CA and an AU. Intermediate errorcorrections in the initial text may be generated via contextualprocessing of the short signal segments and used immediately as anintermediate error correction for the initial text presented to the CAand AU. Final error correction in the intermediately corrected text maybe generated via contextual processing of the long signal segments andused to finally error correct the intermediately corrected text for boththe CA and the AU.

While initial, intermediate and final ASR text may be presented to eachof the CA and an AU in some cases, in other embodiments the intermediatetext may only be presented to one or the other of the CA and the AU. Forinstance, where initial text results may be displayed for each of the CAand the AU, intermediate results related to contextual processing ofshort voice signal slices may be used to in line correct errors in theCA presented text only to minimize distractions on the AU's displayscreen.

While the signal slicing and initial and final text selection processeshave been described above as being performed by a relay processor, inother embodiments where an AU device or even an HU device links to anASR engine to provide an HU voice signal thereto and receive texttherefrom, the AU or HU device would be programmed to slice the voicesignal for transmission in a similar fashion and to select initial andfinal and in some cases intermediate text to be presented to systemusers in a fashion similar to that described above.

While ASR engines operate well under certain circumstances, they aresimply less effective than pure CA transcription systems under othersets of circumstances. For instance, it has been observed that during afirst short time just after an AU-HU call commences and a second shorttime at the end of the call when accurate content is particularly timesensitive as well as often unclear and rushed, full CA modes have aclear advantage over ASR-CA backed up modes. For this reason, in atleast some embodiments it is contemplated that one type of system mayinitially link the HU portion of a call to a full CA mode where a CAtranscribes text and corrects that text for at least the beginningportion of the call after which the call is converted to an ASR-CAbacked up call where an ASR engine generates initial text and ASRcorrections with a CA further correcting the initial and final ASR text.For instance, in some cases the HU voice signal during the first 10-15seconds of an AU-HU call may be handled by the full CA mode andthereafter the ASR-CA backed up mode may kick in once the ASR hascontext for subsequent words and phrases to increase overall ASRaccuracy.

In some cases only a small subset of highly trained CAs may handle thefull CA mode duties and when the ASR-CA backed up mode kicks in, thecall may be transferred to a second CA that operates as a correctiononly CA most of the time. In other cases a single CA may operate in thefull CA mode as well as in the ASR-CA backed up mode to maintaincaptioning service flow.

It has been recognized that for many AUs that have at least partialhearing capabilities, in most cases during an AU-HU call by far the mostimportant caption text is the text associated with the most recentlygenerated HU voice signal. To this end, in many cases an AU that has atleast partial hearing relies on her hearing as opposed to caption textto understand HU communications. Then, when an AU periodicallymisunderstands an HU voiced word or phrase, the AU will turn todisplayed captions to clarify the HU communication. Here, most AUs wantimmediate correct text in real time as opposed to three or six or moreseconds later after a CA corrects the text so that the corrections areas simultaneous with a real time HU voice signal broadcast as possible.To be clear, in these cases, correct text corresponding to the mostrecent 7 or less seconds of HU voice signal is far more important mostof the time than correct text associated with HU voice signal from 20seconds ago.

In these cases and others where accurate substantially real time text isparticularly important, a captioning system processor may be programmedto enforce a maximum cumulative duration of HU voice signal broadcastpause seconds to ensure that all CA correction efforts are at leastsomewhat aligned with the HU's real time voice signal. For instance, insome cases the maximum cumulative pause signal may be limited to sevenseconds or five seconds or f even three seconds to ensure thatessentially real time corrections to AU captions occur. In other casesthe maximum cumulative delay may be limited by a maximum number of ASRtext words so that, for instance, a CA cannot get more than 3 or 5 or 7words behind the initially generated ASR text.

Referring now to FIG. 52, an exemplary CA display screen shot 1650 isillustrated that presents ASR text to a CA as the CA listens to ahearing user's voice signal via headset 54 as indicated at 1654. In thiscase, the CA is restricted to editing only text that appears in the mostrecent two lines 1662 and 1664 of the presented text which is visuallydistinguished by an offsetting box labelled 1656. Box 1660 staysstationary as additional ASR generated text is generated and added tothe bottom of the text block 1652 and the on screen text scrolls upward.Again, as in several other figures described above, a system processorhighlights or otherwise visually distinguishes the text word thatcorresponds to the instantaneously broadcast HU voice signal word asshown at 1660. Here, however, when the text 1652 scrolls up one line, ifthe word being broadcast is in the top line 1662 in box 1660 whenscrolling occurs, the broadcast to the CA skips to the first word in thenext line 1664 when a new line of text is added there below. To this endsee FIG. 53 where one line of scrolling occurred while the system wasstill broadcasting a word in line 1662 in FIG. 52 so that thehighlighted and broadcast word is skipped ahead to the word “want” atthe beginning of line 1664.

In some cases a limitation on CA corrections may be based on the maximumamount of text that can be presented on the CA display screen. Forinstance, in a case where only approximately 100 ASR generated words canappear on an AU's display screen, it would make little sense to allow aCA to correct errors in ASR text prior to the most recent 100 wordsbecause it is highly likely that earlier corrections would not bevisible by the AU. Thus, for instance, in some cases a cumulativemaximum seconds delay may be set to 20 seconds where text associatedwith times prior to the 20 second threshold simply cannot be correctedby the CA. In other cases the cumulative maximum delay may be word countbased (e.g., the maximum delay may be no more than 30 ASR generatedwords). In other cases the maximum delay may vary with other sensedparameters such as line signal quality, the HU's speaking rate (e.g.,words per minute actual or average), a CA's current or averagecaptioning statistics, etc.

A CA's ability to correct text errors may be limited in severaldifferent ways. For instance, relatively aged text that a CA can nolonger correct may be visually distinguished (e.g., highlighted,scrolled up into a “firm” field, etc.) in a fashion different from textthat the CA can still correct. As another instance, text that cannot becorrected may simply be scrolled off or otherwise removed from the CAdisplay screen.

Where a CA is limited to a maximum number of cumulative delay seconds,the cumulative delay count may be reduced by any perceived HU silentperiods that occur between a current time and a time that precedes thecurrent time by the instantaneous delay count. Thus, for instance, if acurrent delay second count is 18 seconds, if the most recent 18 secondsincludes a 12 second HU silent period (e.g., during an AU talking turn),then the cumulative delay may be adjusted downward to 6 seconds as thesystem will be able to remove the 12 second silent period from CAconsideration so that the CA can catch up more rapidly.

In at least some cases it has been recognized that signal noise canappear on a communication link where the noise has a volume and perhapsother detected characteristics but that cannot be identified by an ASRengine as articulated words. Most of the time in these cases the noiseis just that, simply noise. In some cases where line signal can clearlybe identified as noise, a period associated with the noise may beautomatically eliminated from the HU voice signal broadcast to a CA forconsideration so that those noisy periods do not slow down CA captioningof actual HU voice signal words. In other cases where an ASR cannotidentify words in a received line signal but cannot rule out the linesignal as noise, a relay processor may broadcast that signal to a CA ata high rate (e.g., 2 to 4 times the rate of HU speech) so that thepossible noisy period is compressed. In most cases where the line signalis actually noise, the CA can simply listen to the expedited signal,recognize the signal as noise, and ignore the signal. In other cases theCA can transcribe any perceived words or may slow down the signal to anormal HU speech rate to better comprehend any spoken words. Here, oncethe ASR recognizes a word in the HU voice signal and generates acaptioned word again, the pace of HU voice signal broadcast can beslowed to the HU's speech rate.

In cases where a CA switches from an ASR-CA backed up mode to a full CAmode, in at least some embodiments, the non-firm ASR generated text iserased from the CA's display screen to avoid CA confusion. Thus, forinstance, referring again to FIG. 23A, if a CA selects the full CAcaptioning/correction button 751 to initiate a pure CA texttranscription and correction process, the CA display screen shot may beswitched to the shot illustrated in FIG. 47. As shown in FIG. 47, firmASR text prior to the current word considered by the CA at 781 orcorrected by the CA persists at 783 but ASR generated text thereafter iswiped from the display screen. The label on the caption source switchbutton 751 is changed to now present the CA the option to switch back tothe ASR-CA backed up type system if desired. The seconds behind field isstill present to give the CA a sense of how well she is keeping up withthe HU voice signal.

When a CA changes from the ASR-CR backed up mode to a full CA mode, insome embodiments there will be no change in what the AU sees on herdisplay screen and no way to discern that the change took place so thatthere is no issue with visually disrupting the AU during the switchover.In other embodiments there may be some type of clean break so that theAU has a clear understanding that the captioning process has changed.For instance, see FIG. 48 where, after a CA has selected the full CAmode option, a carriage return occurs after the most recently generatedASR generated text 1500 and a line 1502 is presented to delineateinitial ASR and CA generated text. After line 1502, CA generated text ispresented to the AU as indicated at 1504. Here, all ASR text previouslypresented to the AU persists regardless of whether or not the text isfirm or not and any initial CA generated text that is inconsistent withASR generated text is used to correct the ASR generated text via inlinecorrection so that the ASR generated text that is not firm is notcompletely wiped from the AU's device display screen.

Thus, for instance, in one exemplary system, when a CA takes overinitial captioning from an ASR, while ASR generated text that followsthe point in an HU voice broadcast most recently listened to orcaptioned by a CA is removed from the CA's display screen to avoid CAconfusion, that same ASR generated text remains on the AU's displayscreen so that the AU does not recognize that the switch over to CAcaptioning occurred from the text presented. Then, as the CA re-voicesHU voice signal to generate text or otherwise enters data to generatetext for the HU voice signal, any discrepancies between the ASRgenerated text on the AU display screen and the CA generated text areused to perform in line corrections to the text on the AU display. Thus,to the CA, the initial CA generated text is seen as new text while theAU sees the initial text, up to the end of the prior ASR generated textas in line error corrections.

When a CA initiates a switch from a full CA mode to an ASR-CA backed upmode, the CA display screen shot may switch from a shot akin to the FIG.47 shot back to the FIG. 23A shot where the button 751 caption is againswitched back to “Full CA Captioning/Correction”, the firm text andseconds behind indicator persist at 748A and 755 and where ASR generatednon-firm text is immediately presented at 769 subsequent to the word750A currently broadcast 752A to the CA for consideration andcorrection.

When a CA initiates a switch from a full CA mode to an ASR-CA backed upmode, again, in some embodiments there may be no change in what the AUsees on her display screen and no way to discern that the switch to theASR-CA backed up mode took place so that the AU's visual experience ofthe captioned text is not visually disrupted. In other embodiments theAU display screen shot may switch from a shot akin to the FIG. 48 shotto a screen shot akin to the shot shown in FIG. 49 where a carriagereturn occurs after the most recently generated ASR generated text 1520and a line 1522 is presented to delineate initial CA generated andcorrected text from following ASR generated and CA corrected text. Afterline 1522, CA generated and corrected text is presented to the AU asindicated at 1524. Here, all CA generated text previously presented tothe AU persists.

While the CA and AU display screen shots upon caption source switchingare described above in the context of CA initiated caption sourceswitching, it should be appreciated that similar types of switchingnotifications may be presented when an AU initiates the switchingaction. To this end, see, for instance, that in some cases when thesystem is operating as a full CA captioning system as in FIG. 48, an“ASR-CA Back Up” button 771 is presented that can be selected to switchback to an ASR-CA backed up mode operation in which case a screen shotsimilar to the FIG. 49 shot may be presented to the AU where line 1522delineates the breakpoint between the CA generated initial text aboveand the ASR generated initial text that follows.

As another instance, see that in some cases when the system is operatingas an ASR-CA backed up mode as in FIG. 49, a “Full CACaptioning/Correction” button 773 is presented that can be selected toswitch back to full CA captioning and correction system operation inwhich case a screen shot similar to the FIG. 48 shot may be presented tothe AU where line 1502 delineates the breakpoint between the ASRgenerated initial text above and the CA generated initial text thatfollows.

In at least some embodiments as the system operates in the ASR-CA backedup mode of operation, as text is presented to a CA to consider the textfor correction, the CA may be limited to only correcting errors thatoccur prior to a current point in the HU voice signal broadcast to theCA. Thus, for instance, referring again to FIG. 23A where a currentlybroadcast HU voice signal word is “restaurant”, CA corrections may belimited to text prior to the word restaurant at 748A so that the CAcannot change any of the words at 769 until after they are broadcast tothe CA.

In at least some embodiments when the system is in the ASR-CA backed upmode, a CA mute feature is enabled whenever the CA has not initiated acorrection action and automatically disengages when the CA initiatescorrection. For instance, referring again to FIG. 50, assume a CA isreviewing the ASR generated text to identify text errors as she islistening to the HU voice signal broadcast. Here, if the CA selects thewords “Pistol Pals” via touch as indicated at 1560, the selected text isvisually distinguished, the HU voice signal broadcast to the CA halts atthe word “restaurant”, CA keyboard becomes active for enteringcorrection text and the muted CA microphone is activated so that the CAhas the option to enter corrective text either via the keyboard or viathe microphone. In addition, the HU's voice segment including at leastthe annunciation related to the selected words “Pistol Pals” isimmediately rebroadcast to the CA for consideration while viewing thewords “Pistol Pals”. Once the CA corrections are completed, the CAmicrophone is again disabled and the HU voice signal broadcast skipsback to the word “restaurant” where the signal broadcast recommences. Insome cases selection of the phrase “Pistol Pals” may also open a dropdown window with other probable options for that phrase generated by theASR engine or some other processor function where the CA can quicklyselect one of those other options if desired.

In some embodiments when a CA starts to correct a word or phrase in anASR text transcript, once the CA selects the word or phrase forcorrection, a signal may be sent immediately to an AU device causing theword or phrase to be highlighted or otherwise visually distinguished sothat the AU is aware that it is highly likely that the word or phrase isgoing to be changed shortly. In this way, an AU can recognize that aword or phrase in an ASR text transcription is likely wrong and if shewas relying on the text representation to understand what the HU said,she can simply continue to view the highlighted word or phrase until itis modified by the CA or otherwise cleared as accurate.

Under at least some circumstances an ASR engine may lag an HU voicesignal by a relatively long and unacceptable duration. In at least someembodiments it is contemplated that when a relay operates in an ASR-CAbacked up mode (e.g., where the ASR generates initial text forcorrection by a CA), a system processor may track ASR text transcriptionlag time and, under at least certain circumstances, may automaticallyswitch from the ASR backed up mode to a full CA captioning andcorrection mode either for the remainder of a call or for at least someportion of the call. For instance, when an ASR lag time exceeds somethreshold duration (e.g., 1-15 seconds), the processor may automaticallyswitch to the full CA mode for a predetermined duration (e.g., 15seconds) so that a CA can work to eliminate or at least substantiallyreduce the lag time after which the system may again automaticallyrevert back to the ASR-CA backed up mode. As another instance, once thesystem switches to the full CA mode, the system may remain in the fullCA mode while the ASR continues to generate ASR engine text in paralleland a system processor may continue to track the ASR lag time and whenthe lag time drops below the threshold value either for a short durationor for some longer threshold duration of time (e.g., 5 consecutiveseconds), the system may again revert back to the ASR-CA backed upoperating mode. In still other cases where a system processor determinesthat some other communication characteristic (e.g., line quality, noiselevel, etc.) or HU voice signal characteristic (e.g., WPM, slurring ofwords, etc.) is a likely cause of the poor ASR performance, the systemmay switch to full CA mode and maintain that mode until the perceivedcommunication or voice signal characteristic is no longer detected.

In at least some cases where a third party provides ASR engine services,ASR delay can be identified whenever an HU voice signal is sent to theengine and no text is received back for at least some inordinatethreshold of time.

In at least some cases the ASR text transcript lag time that triggers aswitch to a full CA operating mode may be a function of specific skillsor capabilities of a specific CA that would take over full captioningand corrections if a switch over occurs. Here, for instance, given apersistent ASR delay of a specific magnitude, a first CA may be able tobe substantially faster while a second could not so that a switch overto the second CA would only be justifiable if the persistent ASR delaywas much longer. Here it is contemplated that CA profiles will includespeed and accuracy metrics for associated CAs which can be used by thesystem to assess when to change over to the full CA system and when notto change over depending on the CA identity and related metrics.

In at least some embodiments it is contemplated that a relay processormay be programmed to coach a CA on various aspects of her relayworkstation and how to handle calls generally and even specific callswhile the calls are progressing. For instance, in at least some caseswhere a CA determines when to switch from an ASR-CA backed operatingmode to a full CA mode, a system processor may track one or more metricsduring the ASR-CA backed operating mode and compare that metric tometrics for the CA in the CA profile to determine when a full CA modewould be better than the ASR-CA backed mode by at least some thresholdvalue (e.g., 10% faster, 5% more accurate, etc.). Here, instead ofautomatically switching over to the full CA mode when that mode wouldlikely be more accurate and/or faster by the threshold value, aprocessor may present a notice or warning to the CA encouraging the CAto make the switch to full CA mode along with statistics indicating thelikely increase in captioning effectiveness (e.g., 10% faster, 5% moreaccurate). To this end, the exemplary statistics shown at 1541 in FIG.50 that are associated with a “Full CA Captioning/Correction” button.

In a similar fashion, when a CA operates a relay workstation in a fullCA mode, the system may continually track metrics related to the CA'scaptions and compare those to estimated ASR-CA backed up mode estimatesfor the specific CA (e.g., based on the CA's profile performancestatistics) and may coach the CA on when to switch to the ASR-CA backedoperating mode. In this regard, see for instance the speed and accuracystatistics shown at 753 in FIG. 47 that are associated with the ASR-CABack Up button 751.

In at least some embodiments it is contemplated that a CA will be ableto set various station operating parameters to preferred settings thatthe CA perceives to be optimal for the CA while captioning. Forinstance, in cases where a workstation operating mode can be switchedbetween ASR-CA backed and full CA, a CA may be able to turn automaticswitching on or turn that switching off so that a switch only occurswhen the CA selects an on screen or other interface button to make theswitch. As another instance, the CA may be able to specify whether ornot metrics (e.g., speed and accuracy as at 753 in FIG. 47) arepresented to the CA to encourage a manual mode switch. As anotherinstance, a CA may be able to adjust a maximum cumulative captioningdelay period that is enforced during calls. As still one other instance,a CA may be able to turn on and off a 2 times or 3 times broadcast ratefeature that kicks in whenever a CA latency value exceeds some thresholdduration. Many other station parameters are contemplated that may be setto different operating characteristics by a CA.

In at least some cases it is contemplated that a system processortracking all or at least a subset of CA statistics for all or at least asubset of CAs may routinely compare CA statistical results to identifyhigh and low performers and may then analyze CA workstation settings toidentify any common setting combinations that are persistentlyassociated with either high or low performers. Once persistent highperformer settings are identified, in at least some cases a systemprocessor may use those settings to coach other CAs and, morespecifically, low performing CAs on best practices. In other cases,persistent high performer settings may be presented to a systemadministrator to show a correlation between those settings andperformance and the administrator may then use those settings to developbest practice materials for training other CAs.

For example, assume that several CAs set workstation parameters suchthat a system processor only broadcasts HU voice signal corresponding tophrases that have confidence factors of 6/10 or less at the HU'sspeaking rate and speeds up broadcast of any HU voice signalcorresponding to phrases that have 7/10 or greater confidence factors to2× the HU's speaking rate. Also assume that these setting result insubstantially faster CA error correction than other station settings. Inthis case, a notice may be automatically generated to lower performingCAs encouraging each to experiment with the expedited broadcast settingsbased on ASR text confidence factors.

Various system gaming aspects have been described above where CAstatistics are presented to a CA to help her improve skills andcaptioning services in a fun way. In some cases it is contemplated thata system processor may routinely compare a specific CA with her ownaverage and best statistics and present that information to th CA eitherroutinely during calls or at the end of each call so that the CA cancompete against her own prior statistics. In some cases two or more CAsmay be pitted against each other sort of like a race to see who cancaption the fastest, correct more errors in a short period of time,generate the most accurate overall caption text, etc. In some cases CAsmay be able to challenge each other and may be presented real timecaptioning statistics during a challenge session where each gets tocompare their statistics to the other CA's real time statistics. To thisend, see the exemplary dual CA statistics shown at 771 in FIG. 47 wherethe statistics shown include average captioning delay, accuracy leveland number of errors corrected for a CA using a station that includesthe display screen 50 and another CA, Bill Blue, captioning andcorrecting at a different station. Leaders in each statistical categoryare visually distinguished. For instance, statistic values that are bestin each category are shown double cross hatched in FIG. 47 to indicategreen highlighting.

While CA call and performance metrics may be textually represented insome cases, in other cases particularly advantageous metric indicatorsmay have at least some graphic characteristics so that metrics can beunderstood based on a simple glance. For instance, see the graphicalperformance representation at 787 in FIG. 47 where arrows 789 thatrepresent instantaneous statistics dynamically float along horizontalaccuracy and speed scales to indicate performance characteristics. Insome cases the graphical characteristics may be calculated relative topersonal averages from a specific CA's profile and in other cases thecharacteristics may be calculated relative to all or a subset of CAsassociated with the system.

In some embodiments it is contemplated that CAs may be automaticallyrewarded for good performance or increases in performance over time. Forinstance, each 2 hours a CA performs at or above some thresholdperformance level, she may be rewarded with a coupon for coffee or someother type of refreshment. As another instance, when a CA's persistenterror correction performance level increases by 5% over time, she may begranted a paid one hour off at the end of the week. As yet one otherinstance, where CA's compete head to head in a captioning and correctingcontest, the winner of a contest may be granted some reward to incentperformance increases over time.

In line error corrections are described above where initial ASR or CAgenerated text is presented to an AU immediately upon being generatedand then when a CA or an ASR corrects an error in the initial text, theerroneous text is replaced “in line” in the text already presented tothe AU. In at least some cases the corrected text is highlighted orotherwise visually distinguished so that an AU can clearly see when texthas been corrected. Major and minor errors are also described where aminor error is one that, while wrong, does not change the meaning of anincluding phrase while a major error does change the meaning of anincluding phrase.

It has been recognized that when text on an AU display screen is changedand visually distinguished often, the cumulative highlighted changes canbe distracting. For this reason, in at least some embodiments it iscontemplated that a system processor may filter CA error corrections andmay only change major errors on an AU display screen so that minorerrors that have no effect on the meaning of including phrases aresimply not shown to the AU. In many cases limiting AU text errorcorrection to major error corrections can decrease in line on screencorrections by 70% or more substantially reducing the level ofdistraction associated with the correction process.

To implement a system where only major errors are corrected on the AUdisplay screen, all CA error corrections may be considered in context bya system processor (e.g., within including phrases) and the processorcan determine if the correction changes the meaning of the includingphrase. Where the correction affects the meaning of the includingphrase, the correction is sent to the AU device along with instructionsto implement an in line correction. Where the correction does not affectthe meaning of the including phrase, the error may simply be disregardedin some embodiments and therefore never sent to the AU device. In othercases where a correction does not affect the meaning of the includingphrase, the error may still be transmitted to the AU device and used tocorrect the error in a call text archive maintained by the AU device asopposed to in the on screen text. In this way, if the AU goes back in acall transcript to review content, all errors including major and minorare corrected.

In other embodiments, instead of only correcting major errors on an AUdevice display screen, all errors may be corrected but the system mayonly highlight or otherwise visually distinguish major errors to reduceerror correction distraction. Here, the thinking is that if and AU caresat all about error corrections, the most important corrections are theones that change the meaning of an including phrase and therefore thosechanges should be visually highlighted in some fashion.

In a similar fashion, automated ASR error corrections may be transmittedto a CA workstation where major and minor errors are treateddifferently. As in the case of how errors may be used by an AU captioneddevice, a CA workstation may only make major error changes on the CAdisplay, may make all error changes and only highlight or otherwisevisually distinguish major errors from other captioned text, may makemajor error changes in real time as they are received at the relay andminor error changes in archived text, etc.

CA Sensors

(i) Eye Sight Trajectory Sensor(s)

CA station sensor devices can be provided at CA workstations to furtherenhance a CA's captioning and error correction capabilities. To thisend, in at least some embodiments some type of eye trajectory sensor maybe provided at a CA workstation for tracking the location on a CAdisplay screen that a CA is looking at so that a word or phrase on thescreen at the location instantaneously viewed by the CA can beassociated with the CA's sight. To this end, see, for instance, the CAworkstation 1700 shown in FIG. 54 that includes a display screen 50,keyboard 52 and headphones 54 as described above with respect to FIG. 1.In addition, the station 1700 includes an eye tracking sensor systemthat is represented by numeral 1702 that is directed at a CA's locationat the station and specifically to capture images or video of the CAusing the station. The camera field of view (FOV) is indicated at 1712and is specifically trained on the face of a CA 1710 that currentlyoccupies the station 1700.

Referring still to FIG. 54 and also to FIG. 55, images from sensor 1702can be used to identify the CA's eyes and, more specifically, thetrajectory of the CA's line of sight as labelled 1714. As best shown inFIG. 55, the CA's line of sight intersects the display screen 50 at aspecific location where the text word “restaurant” is presented. In someembodiments, as illustrated, the word a CA is currently looking at onthe screen 50 will be visually highlighted or otherwise distinguished asfeedback to the CA indicating where the system senses that the CA islooking. Known eye tracking systems have been developed that generateinvisible bursts of infrared light that reflects differently off astation user's eyes depending on where the user is looking. A camerapicks up images of the reflected light which is then used to determinethe CA's line of sight trajectory. In other cases a CA may wear aheadset that tracks headset orientation in the ambient as well as theCA's pupil to determine the CA's line of sight. Other eye trackingsystems are known in the art and any may be used in various embodiments.

Here, instead of having to move a mouse cursor to a word on the displayscreen or having to touch the word on the screen to select it, a CA maysimply tap a selection button on her keyboard 52 once to select thehighlighted word (e.g., the word subtended by the CA's light of sight)for error correction. In some cases a double tap of the keyboardselection button may cause the entire phrase or several words before andafter the highlighted word to be selected for error correction.

Once a word or phrase is selected for error correction, the current HUvoice signal broadcast 1720A may be halted, the word or phrase selectedmay be differently highlighted or visually distinguished and thenre-broadcast for CA consideration as the CA uses the keyboard ormicrophone to edit the highlighted word or phrase. Once the word orphrase is corrected, the CA can tap an enter key or other keyboardbutton to enter the correction and cause the corrected text to betransmitted to the AU device for in line correction. Once the enter keyis selected, HU voice signal broadcast would recommence at the word 1720where it left off.

In some embodiments the eye tracking feature may be used to monitor CAactivity and, specifically, whether or not the CA is considering alltext generated by an ASR or CA re-voicing software. Here, another metricmay include percent of text words viewed by a CA for error correction,durations of time required to make error corrections, etc.

(ii) CA Fatigue Sensor(s)

In at least some cases a CA workstation may be equipped with one or moresensor devices that generate data useable by a system processor toassess CA fatigue. For instance, a camera or a touch sensor built into aCA's wrist rest, keyboard or other input device may be able to generatedata usable to assess blood pressure, heartrate, perspiration rate, orany other biometric parameter suitable to assess CA stress level orfatigue. Here, the system may automatically adjust a CA's captioningschedule in any of several different ways. As one simple example, when aCA's fatigue level exceeds some threshold level consistent with lowproductivity (e.g., a level that is consistent with a drop in captioningproductivity or accuracy or some combination of those), the system maysimply schedule a 10 minute break to give the CA a time to rejuvenate.

As another example, when a CA's fatigue level exceeds the threshold, thesystem may steer calls that are perceived to be relatively easy tocaption to the CA for at least some duration so that, despite lowerproductivity, the CA may still be able to meet AU and systemexpectations related to speed and accuracy. Here, for instance, wherefirst and second CAs are handling first and second calls that areassessed by a system processor to be relatively easy and relatively hardto caption (e.g., easy meaning HU speaking rate is relatively slow, HUvoice signal easy to understand, etc., and hard meaning HU speaking rateis fast and/or voice signal is hard to understand), respectively, andwhere the system ascertains that the second CA is exhausted, the systemmay automatically swap the remainder of the calls between the first andsecond CAs so that the second CA handles the first relatively easiercall and can more easily meet speed and accuracy expectations.

Multiple ASR Systems

In at least some embodiments it is contemplated that two or more ASRengines of different types (e.g., developed and operated by differententities) may be available for HU voice signal captioning. In thesecases, it is contemplated that one of the ASR engines may generatesubstantially better captioning results than other engines. In somecases it is contemplated that at the beginning of an AU-HU call, the HUvoice may be presented to two or more ASR engines so that two or more HUvoice signal text transcripts are generated. Here, a CA may correct oneof the ASR text transcripts to generate a “truth” transcript presentedto an AU. Here, the truth transcript may be automatically compared by aprocessor to each of the ASR text transcripts associated with the callto rank the ASR engines best to worst for transcribing the specificcall. Then, the system may automatically start using the best ASR enginefor transcription during the call and may scrap use of the other twoengines for the remainder of the call. In other cases while the otherengines may be disabled, they may be re-enabled if captioning metricsdeteriorate below some threshold level and the process above ofassigning metrics to each engine as text transcripts are generated maybe repeated to identify a current best ASR engine to continue servicingthe call.

In another multi-ASR system, a plurality of ASR engines may persistentlyoperate to generate multiple ASR caption streams throughout a call and aprocessor may automatically switch the stream transmitted to an AU andpresented for correction to a CA based on relative accuracies of theseparate streams. Thus, for instance, where five different ASR enginesapplying different voice to text algorithms generate five different ASRcaption streams, a processor may compare each ASR caption stream to aCA's corrected captions over a rolling comparison period (e.g., 10seconds to one minute) to assess the recently most accurate ASR engineand may then switch ASR streams presented to the CA and AU so that ASRcaption accuracy is maximized. This type of system will be particularlyuseful in cases where HU voice signal quality or line noise changes or aspeaker at the HU end of a call (e.g. a child takes over a call from afather) changes during the course of a call so that one ASR engine maybe most accurate at one time while another is most accurate at adifferent time.

Referring now to FIG. 59, an exemplary process 1900 for switchingbetween a plurality of ASR engines based on engine accuracy isillustrated as a flow chart. As shown, process 1900 utilizes a pluralityof ASR engine sub-processes 1906, 1908, 1910, 1912 and 1914 (ASRN whereN may be any integer value) where the sub-processes are similar, albeitapplying different voice to text algorithms and therefore generating atleast some different caption results at times. Exemplary ASR engineprocess 1906 includes a first process block 1920 where the ASR1 enginereceives an HU voice signal and generates ASR1 captions as well as ablock 1922 where the ASR1 captions are compared to CA corrected captionsto generate an ASR1 accuracy metric. While not separately labelled, eachof the other ASR engines two process blocks similar to blocks 1920 and1922, albeit where the different ASR algorithms (e.g., ASR2, ASR3, etc.)are used to generate the ASR caption streams for comparison to the CAcorrected captions.

At the beginning of a call captioning session (e.g., after an AU hasrequested caption service for an ongoing call), tin at least someembodiments there will be no way to ascertain an optimal ASR for acurrent call and therefore the system is programmed to use a default ASRengine at least initially until accuracy metrics for the plurality ofASR engines are generated to fuel selection of an instantaneouslyoptimal ASR engine. In the FIG. 59 example, at initial process block1902 an initial ASR countdown timer is initiated during which thedefault ASR engine is used to generate initial captions for an AU and aCA. Here, for instance, the initial countdown timer may be set for ashort duration (e.g., 10 seconds to 90 seconds). In other cases, theinitial ASR engine may be employed for a predefined number of captionedwords (e.g., 100) or until some other occurrence such as, for instance,a discernible and appreciable difference in ASR engine accuracy. Unlessindicated otherwise, the condition (e.g., timed out timer, number ofcaptioned words, etc.) that must occur prior to switching to the mostaccurate ASR engine will be referred to as the “initial switchingcondition” generally.

Referring still to FIG. 59, at block 1904 a current ASR_(current) is setequal to the default ASR (e.g., ASR1 in the illustrated example) that isthen used prior to occurrence of the initial switching condition togenerate text initial captions to send to the AU for display and to theCA for error correction.

The HU voice signal received at the relay is provided in parallel toeach of the first ASR1 1906 through Nth ASRN 1914 automated captioningengines. The initial default engine ASR1 automatically generates firstASR1 captions at block 1920 which are the ASRcurrent captions prior tooccurrence of the initial switching condition. In FIG. 59, the initialswitching condition (e.g., the initial ASR countdown timer condition) ismonitored at decision block 1926 and, until that condition is met,control passes down to block 1930. At block 1930, a relay processortransmits the ASRcurrent captions to the AU captioned device fordisplay.

At block 1932, the processor presents the ASRcurrent captions on a CAworkstation display screen and broadcasts the HU voice signal to the CAto hear. At block 1934 a CA corrects any perceived errors in theASRcurrent captions and at 1936 the corrections are transmitted to theAU captioned device which is programmed to make in line corrections orother corrections consistent with the CS corrections.

Referring still to FIG. 59, at block 1918 the CA corrected captions areprovided to the second block in each of the ASR engine processes (e.g.,ASR1 through ASRN). At block 1922 in the first process 1906 compares theASR1 captions to the CA corrected captions to generate an ASR1 accuracymetric for a preceding duration X. Here, the duration X may be equal tothe countdown timer duration, less than or greater than that duration.In at least some cases the preceding duration X will be within a rangeof 10 seconds to 90 seconds although other ranges are contemplated.Similarly, each of the other ASR processes 1908 through 1914 comparesassociated ASR captions to the CA corrected captions thereby generatinga separate accuracy metric ASR2 through ASRN for each of thoseprocesses.

All of the ASR accuracy metrics are provided to decision block 1926 and,eventually, once the initial switching condition is met (e.g., thecountdown timer expires), are provided to block 1927. At block 1927,once the initial switching condition occurs, a processor compares theASR accuracy metrics for each engine process 1906 through 1914 toidentify the most accurate engine over the most recent duration X (e.g.,X rolls over time during an ongoing call). At block 1928, the currentASR ASRcurrent is set to the most accurate ASR after which controlpasses to block 1930 where the process described above continues.

During a long HU-AU captioning session, it is possible that the mostaccurate ASR engine will change several times during the session as lineand signal quality changes or as the person on the HU end of the callchanges. To avoid rapid or essentially meaningless ASR engine changes,in at least embodiments a threshold for accuracy increase may be set sothat the system only switches from a current ASR to a more accurate ASRif the more accurate ASR is more accurate by a threshold percent (e.g.,10% more accurate). Similarly, the system may impose a limit on the rateof ASR changes so that, for instance, no more than one ASRcurrent changeoccurs every 20 seconds

The accuracy metric may take many different forms. For instance, in somecases the accuracy metric may simply comprise a count or errors thatoccurred over the prior X duration. As another instance, the accuracymetric may be based on a count of errors that change the meaning ofcaptions over the prior X duration. Many other accuracy metrics arecontemplated.

In some cases, the initial default ASR engine in FIG. 59 may change as afunction of ASR accuracy metrics that occurred during prior calls. Forinstance, if one of the ASR engines routinely has the highest accuracymetric, the system may automatically initiate all caption sessions withthat ASR as the default. If the ASR with the routinely highest accuracymetric changes over time, a different ASR with the best accuracy metricmay be set as the default.

In some embodiments it is contemplated that where a captioning sessionis commenced during an ongoing HU-AU call, HU voice signal prior tocommencement of the captioning session may be automatically captured andused to assess a most accurate ASR engine to be used as theASR_(current) engine once a captioning session starts. Here, because aCA only corrects ASR captions after a caption session is initiated therewould be no CA corrected captions to operate as “truth” for assessingASR accuracy metrics as in FIG. 59. For this reason, therefore, a systemprocessor would have to generate confidence factors related to the ASRcaptions which would then be used to assess ASR accuracy metrics used toselect a most accurate ASR as the initial default when a captioningsession commences.

In still other cases it is contemplated that a system processor may beprogrammed to use some set of call characteristics to select a currentASR_(current) instead of relying on accuracy disparities. For instance,as a simple example, it may be that a third ASR3 engine out of the Nengines routinely generates higher caption accuracy when an HU-AU phonelink has a noise level above some threshold level. In this case, aprocessor may monitor line noise and select the ASR_(current) basedthereon. Other call characteristics and combinations of characteristicsto trigger specific ASR_(current) engines are contemplated including HUvoice signal volume, vowel shapes, dynamic pitch range, etc.ASR_(current) selection may be based on pre-caption session callcharacteristics, characteristics during an ongoing call, or acombination of both.

In at least some cases a system processor will be programmed todynamically learn which ASR engine is most accurate or meets otherdesirable characteristics for calls with specific call characteristicsand may then use the most advantageous ASR engine to generate captionsbased on perceived call characteristic sets.

While FIG. 59 is described in the context of ASR accuracy metrics, othermetrics may also be considered when the system selects an ASR_(current).For instance, a speed metric related to how quickly specific ASR enginesgenerate text or firm up automated ASR error corrections may begenerated for each ASR engine and used as a factor in selecting theASR_(current) engine. Here, the speed or correction firming metric maybe dispositive or may be combined with other factors (e.g., accuracy) toselect the ASR_(current) engine.

In some cases a switch from one ASR_(current) engine to a next may bedelayed until some additional event occurs. For instance, the switch toa next ASR_(current) engine may only occur upon occurrence of a silenceperiod in the HU voice signal or upon an utterance by an AU.

In some cases each ASR may generate confidence factors for each word,phrase, utterance or call time slice (e.g., 5 second durations) and asystem processor may, for each word, phrase, utterance or call timeslice, use captions that have the highest confidence factor as theASR_(current) captions regardless of which ASR engine generated thecaptions. Thus, at the limit, in a ten word HU utterance, a differentone of the ASR engines may generate each consecutive captioned word inthe final ASR_(current) text presented to an AU and a CA.

Some systems described above include an ASR that is integrated into anAU captioned device or some other device that is directly accessible tothe captioned device instead of or in addition to an ASR located at aremote captioning relay. AU captioning devices with integrated oraccessible ASRs have several advantages in addition to those describedabove. First, see again FIG. 17 where turn piping graphics are presentedat 216 indicating uttered HU voice signal words that have yet to becaptioned by a CA or, in some cases, by either an ASR or a CA. Where anAU captioned device includes an integrated-directly accessible ASR, turnpiping can be done extremely fast to present a better real time sense ofdelay in CA captioning. In this case, even a relatively inaccurate andlikely inexpensive ASR may be integrated into the AU captioned device.

Second, an integrated-directly accessed ASR could result in substantialsavings over a remote cloud based service if the integrated ASR workswell. In this case, the ASR would present automated ASR captions to theAU via the captioned device immediately upon generation and would usedifferences between CA corrected captions and the ASR captions tocorrect the initial text. In some cases in addition to presenting theASR captions to the AU, those captions may be transmitted to a relay tobe presented for error corrections to a CA. In other cases a CA maysimply generate CA captions and make error corrections and each of thosemay be sent to the AU captioned device to make different rounds of errorcorrections to the text presented to the AU.

Third, in systems that run two or more ASRs in parallel, a second oradditional ASR may be provided relatively inexpensively as anintegration in the AU captioned device. Thus, a first ASR engine mayoperate at a remote relay or caption service provider while a second ASRmay be integrated in the captioned device. Here, advantages include alower cost second ASR, an ability to automatically generate at leastsome metrics related to how well a first cloud based captioning engineis operating, ability to assess different ASR caption accuracies, etc.

Fourth, in cases where a first ASR engine is operated by a third partythat a relay links to for captioning, an integrated second ASR wouldenable continued captioning service if the link to the third partycaptioning provider fails. Here, for instance, if the third partycaptioning provider fails to generate captions, a relay may beprogrammed to obtain captions from the integrated ASR engine for errorcorrections so captioning can continue substantially uninterrupted.

Fifth, an integrated ASR can be employed in ways that limit privacyconcerns. To this end, it at least some embodiments it is contemplatedthat a local integrated ASR may be used to generate ASR text and onlyportions of an HU voice signal may be transmitted to a remote relay forcaptioning so that an attending CA only hears portions of an HU voicesignal. For instance, in some cases, an HU voice signal may be dividedby sequential 7 second time slices where an integrated ASR (e.g., ASRthat is integrated into an AU captioned device) generates a complete ASRcaption stream for the HU voice signal and where only every other 7second time slice of HU voice signal is transmitted to the relay for CAcaptioning. In other cases, the local ASR may be used to dynamicallyidentify pauses or other good times at which to start and stop HU voicesignal slices that are transmitted to a relay for captioning.

In at least some cases, a captioned device processor may assignconfidence factors to ASR caption words or phrases and may only transmitlow confidence factor HU voice signal to a relay for more accuratecaptioning service. In still other cases, a captioned device processormay examiner ASR caption text for specific words or phrases that areoften sensitive from a privacy perspective and may, in effect, redactthose words or phrases or phrases that include those words or phrasesfrom the HU voice signal that is transmitted to a relay for captioning.Here, the ASR captions for the redacted words would persist in thecaptions presented to the AU. For example, numbers, names of diseases,etc., may be blocked from the audio transmitted to the relay.

Privacy

Several aspects of and features of various captioning systems related toprivacy are described above. Here we gather various privacy relatedconcepts in one place and embellish several of those concepts withadditional important details.

One problem with existing HU captioning systems where a CA listens to aphone call between an HU and an AU is that the CA may hear an entireconversation between conversing parties. While CA's agree to completeconfidentiality, the privacy guarantee is only as good as a CA's wordand, in any event, AUs and HUs that are aware that a CA hears an entireconversation are often, rightly or wrongly, uncomfortable with a thirdparty listening in on their conversation. For this reason, many of thecaptioning systems disclosed herein operate in ways designed to assuageprivacy concerns of both the AU and HU participating in a call.

One solution that is implemented in at least some embodiments of thepresent disclosure is to only transmit an HU voice signal to a relay forcaptioning. In most systems, at least one party on a call, the HU, hasno or minimal hearing loss and therefore there is no reason to captionan AU's voice and therefore no reason to transmit the AU voice signal tothe relay for captioning. In these cases a CA only hears one side of aconversation (e.g., an HU voice side) which tends to obfuscate themeaning of communications during a call.

While only passing an HU's side of a conversation on to a relay for CAcaptioning affords some privacy related advantages, in many cases CA'scan ascertain a lot about what is being communicated from a single sideof a conversation. For this reason other solutions for increasing thedegree of private communications in a captioning system are desired. Asecond solution, as described above, is to provide a full ASR systemthat captions HU voice signals to be presented to an AU via a displayscreen. While this solution is clearly private as no CA or otheradministrator listens to the HU voice signal and instead a processorruns software to automatically generate HU voice signal captions. Thissolution alone has not worked well as ASR captions are ofteninsufficiently accurate for the purposes of providing meaningfulcaptions for real time communications.

A third solution is to use ASR captions some of the time and have a CAat a relay listen to only time slices of an HU voice signal andtranscribe those time slice signals. For instance, an ASR may generateASR captions for an entire HU voice signal that are presented to an AUimmediately upon caption generation while every other 10 second periodof the HU voice signal is sent to a CA for captioning and correction.Then, the CA captions and error corrections may be sent back to the AUdevice for correcting corresponding portions (e.g., time slices) of theASR captions.

In other cases, an ASR may generate ASR text, assign confidence factorsto each word or phrase in the text and then only transmit low confidenceASR text and corresponding HU voice signal to the relay forconsideration by a CA. This type of system is especially advantageous incases where a CA's captioning lags behind an ASR engine which affordsthe engine an opportunity to caption HU voice and identify a confidencefactor for each word or phrase prior to a CA considering the HU voicesignal for error correction. In fact, in at least some cases it iscontemplated that an HU voice signal may be delayed for a short durationperiod selected so that an ASR has time to generate ASR captions andconfidence factors prior to transmitting (or not) associated HU voicesignal to the AU communication device so that only HU voice signalassociated with low confidence ASR captions and the associated ASRcaptions are sent to a CA for error correction. In these cases where aCA only has the chance to perceive part of one side of a conversation,privacy is increased appreciably.

A fourth solution is to use more than one CA to caption an HU voicesignal or correct ASR or CA generated captions during a call. Forinstance, in a simple case where a relay call center has first andsecond CAs working at one time to caption HU voice signals, first andsecond CAs may provide caption services (e.g., captioning, correction orboth captioning and correction) for consecutive 20 second interleavedslices of an HU voice signal for a single call so that each CA onlyperceives about half of one side of the call. Thus, for instance, afirst CA may listen to a first 20 second duration of an HU voice signalduring a captioning session and generate captions, then a second CA mayhandle the next 20 seconds of HU voice signal while the first CA isdisconnected from the call, then the first CA may be reconnected to thecall to handle the next 20 seconds while the second CA is disconnected,and so on, typically with some overlap between CA segments.

In a case where 200 CAs work at a relay center at a time to captions HUvoice signals, a CA may only be exposed to a small time slice of any onecall. For instance, during a ten minute captioning session where HUvoice signal is divided into 20 second segments, 30 different CAs mayeach handle 20 second HU voice signal slices to provide completecaptioning service during the entire 10 minute call. One advantage hereis that CA downtime when not handling a call can be minimized as anyavailable CA can be assigned to any HU voice signal slice of any ofseveral different simultaneous calls. Thus, for instance, when a firstCA completes a 20 second captioning time slice for a first call, thatfirst CA may only have 3 seconds prior to being assigned a 20 secondtime slice in a second ongoing call, and so on.

In at least some cases where the system automatically swaps in one CAfor another in a time sliced manner, the duration of HU voice signaltime slices may be dynamic and based on HU signal characteristics. Forinstance, time slices may have a range of duration between 15 secondsand 30 seconds and a system processor may select a time slice durationthat makes sense given silent periods in an HU voice signal of othercall factors. For example, if an HU voice signal is silent for 2 secondsat a 17 second duration, the processor may cut out a current CA andswitch to a second CA.

In particularly advantageous systems automatically switching from one CAto another during a single call for privacy reasons will have additionaladvantages. For instance, in at least some cases a processor may beprogrammed to favor switching CAs when current CA captions or errorcorrections lag behind ASR text. For instance, if current CA captionslag 16 seconds behind ASR captions, a system processor may split thedifference and assign a second CA to take over error correctionsstarting 8 seconds back from the current ASR caption time. Thus, here,the first CA would complete the first 8 seconds of captioning of the 16second delay and the second CA would pick up from there, both operatingin parallel to eliminate the 16 second delay in a relatively short time.Here, in addition to facilitating greater privacy by having two CAscaption different sections of an HU voice signal, by switching CAsduring captioning-error correction delays, overall captioning speed canbe increased substantially. In some cases CA switches may only occurwhen a current CA captioning or error correction effort falls behind bysome threshold duration (e.g., 30 seconds).

A fifth solution which also affords a captioning speed advantage inaddition to enhanced privacy is to have CAs that are waiting in a queueto handle incoming caption sessions handle at least portions of ongoingcaptioning sessions where current CA captions or error corrections aresubstantially lagging. Thus, for instance, assume a first CA is 30seconds behind on captioning an HU voice signal in a first ongoing call.Here, a second CA waiting in a queue to handle an incoming session maybe temporarily assigned to the first call to handle the delay and thefirst CA may be skipped ahead automatically to handle real time HU voicesignal. Here, at the end of the HU voice signal corresponding to the 30second delay, the second CA would be disconnected from the first calland placed back in the queue to await a new captioning session or to beassigned to again handle captioning during another prolonged delay in anongoing call.

In still other similar systems, one or more CAs at a relay station maysimply be assigned as catch up CAs where they never handle a completecall and instead are only assigned to calls for short durations (e.g.,less than 60 seconds) to help other CAs catch up to real time HU voicesignals when the other CAs fall behind on captioning. Thus, forinstance, a first “catch up” CA may be assigned for 25 seconds during afirst call, off for 4 seconds, assigned to caption on a second call for32 seconds and then off for 3 seconds, then assigned to a third call for18 seconds, and so on.

In the above cases, second or catch up CAs only hear and perceive shortportions of the HU voice signals and a first or main CA on a call, whilehearing most of the HU voice signal, hears less than all of that signaland therefore privacy is better than in some of the other systemscontemplated above.

In some cases combinations of the above privacy enhancing solutions areimplemented. For instance, in one exemplary system an HU voice signal ona first call may be handle as follows. First, an ASR may receive an HUvoice signal and generate ASR captions for that signal as well asconfidence factors for each word in the HU voice signal. The ASR textmay initially be provided to an AU via a display screen. A systemprocessor may identify only phrases including low confidence factorwords and may only transmit low confidence text and associated HU voicesignal to a relay for CA captioning. At the relay, the first 20 secondsof the HU voice signal corresponding to low confidence ASR captions maybe presented to a first HU for captioning, the second 20 seconds of lowconfidence ASR captions to a second HU for voice captioning, the third20 seconds of low confidence ASR captions to a third HU for voicecaptioning, and so on. For instance, in a first minute of ASR captions,it may be that only 20 seconds of ASR captions are low confidence, andthat 20 seconds would be error corrected by the first CA. Similarly, ineach of second and third minutes of ASR captions, it may be that thereare also 20 seconds of ASR captions that have low confidence factors. Inthis example the second CA would error correct the second 20 seconds oflow confidence factor ASR captions and the third CA would error correctthe third 20 seconds of low confidence factor ASR captions, and so on.Thus, in the first 3 minutes of HU voice signal, each of the CAs wouldonly error correct 20 seconds of the ASR captions and substantialprivacy would persists.

Cloud and Relay ASR Systems

Generally there are two different types of ASRs, ones that can betrained over time based on CA error corrections to captions generated bythe ASR and ones that train automatically where training is not based onCA error corrections. In at least some systems cloud based ASRs (e.g.,ASRs typically operated by fourth parties (e.g., the first through thirdparties being the AU, HU and relay) have no mechanism for consuming CAcaption error corrections and therefore cannot train based off CA errorcorrections while ASRs that are hosted at a relay are typicallytrainable via CA error corrections. Given this reality, why is itadvantageous to use cloud based ASRs for captioning services? The simpleanswer is that cloud based ASRs tend to be far more accurate thantrainable but untrained relay hosted ASRs. At the beginning of mostAU-HU calls, an ASR is not trained and therefore the cloud based ASRsare more accurate than the untrained relay hosted ASRs. One issue withcloud based ASRs is that the captioning service is typically moreexpensive to provide than relay hosted ASRs.

In at least some cases it is contemplated that a relay may employ both acloud based ASR and a relay hosted and trainable (e.g., based on CAerror corrections) ASR to provide automated captions to a CA and an AUwhere a processor selects one or the other of the ASRs based on detectedaccuracy. In a particularly advantageous system, at the beginning of acaptioning session, an HU voice signal is presented to each of a cloudbased first ASR and a relay hosted and trainable second ASR to generatefirst and second ASR caption streams, respectively. At least initially,because the cloud based ASR is almost always more accurate than a relayhosted and trainable (but initially untrained) ASR, the cloud based ASRcaptions are presented to the CA for error correction and immediatelytransmitted to the AU captioned device to be presented to the AU as aninitial HU voice signal caption stream.

A first accuracy metric is generated by comparing the cloud based ASRcaptions to the CA error corrected captions. Similarly, the hosted ASRcaptions are compared to the CA error corrected captions to generate asecond dynamic accuracy metric for the hosted ASR. In addition, the CAerror corrections are used to train the hosted ASR so that the hostedASR accuracy increases over time and, in particular, during a first partof an ongoing call.

A relay processor compares the first accuracy metric (e.g., the cloudbased ASR metric) to the second accuracy metric (e.g., the relay hostedASR metric) and, once the second accuracy metric is better than thefirst, at a minimum, the relay switches over to the hosted ASR captionsand provides those captions to the AU and the CA instead of the cloudbased ASR captions. In addition, in at least some embodiments, the relaymay disconnect and disable the cloud based ASR to avoid incurringunnecessary costs associated therewith.

An exemplary process 1950 that is consistent with at least some aspectsof the present disclosure for running parallel cloud based and relayhosted ASRs is illustrated in FIG. 60. As shown, the process 1950includes a first cloud based ASR1 sub-process 1956 where ASR1 does notlearn from CA caption error corrections and a second cloud based ASR2sub-process 1958 that is relay hosted and trains based on CA errorcorrections during an ongoing call. The two ASR processes ASR1 and ASR2proceed in parallel at the beginning of a call and the ASR1 process iseventually disabled and cut out of the process once second ASR2 accuracyexceeds ASR1 process accuracy. To be clear, ASR1 may train during acall, but would only train based on automatically identified captioningerrors, typically based on automatic contextual caption corrections,instead of training based on CA error corrections.

Referring again to FIG. 60, at the beginning of a captioning sessionduring an HU-AU call, at process block 1952 a current ASR ASR_(current)is set equal to ASR1 (e.g., the cloud based ASR that does not train offCA error corrections). At decision block 1954, a system processordetermines if the first ASR1 has been disabled. In the illustratedprocess, once second ASR2 accuracy exceeds first ASR1 accuracy, thefirst ASR1 is disabled (see blocks 1976 and 1978 in FIG. 60). At leastinitially ASR1 will not be disabled so control passes down to the ASR1and ASR2 sub-processes 1956 and 1958.

At process block 1960, the HU voice signal is provided to the cloudbased ASR1 which generates ASR1 captions. Once a CA corrects a currentASR caption stream to generate a CA corrected caption stream (see blocks1984, 1986 in FIG. 60), the CA corrected stream is received at block1962 and compared to the ASR1 captions at block 1964 to generates adynamic ASR1 accuracy metric which is provided to process block 1974.

Referring still to FIG. 60 and again to block 1954, when ASR1 is notdisabled, in addition to providing the HU voice signal to sub-process1956, that HU voice signal is simultaneously provided to sub-process1958 and, more specifically, process block 1968. At block 1968, therelay hosted ASR2 generates an ASR2 caption stream. At decision block1970, a processor again checks if ASR1 has been disabled because trainedASR2 accuracy exceeds ASR1 accuracy. If ASR1 is still enabled, controlpasses to block 1972 where the CA corrected caption stream is received.At block 1973, the CA corrected captions are compared to the ASR2captions to generate a dynamic ASR2 accuracy metric which is provided toprocess block 1974. As in the FIG. 59 process, the accuracy metrics maybe calculated in many different ways.

At process block 1974, the ASR1 and ASR2 accuracy metrics are comparedto identify the most accurate ASR (e.g., cloud based ASR1 or relayhosted ASR2 that trains off CA error corrections). At decision block1976, if the relay hosted ASR2 is less accurate than the cloud basedASR1, control passes down to block 1982 where the ASR_(current) captionsare transmitted to the AU captioned device for immediate display afterwhich control passes to block 1983.

At process block 1983 a processor monitors for a CA or AU request thatCA error correction persist or, in some cases, be initiated. If a CA orAU requested persistent CA error correction, control passes down toblock 1984. If no CA or AU requested persistent CA error correctioncontrol passes to decision block 1985 where accuracy of theASR_(current) captions is compared to an accuracy threshold valueAcc_(threshold) (e.g., 95% accurate). Where ASR_(current) captionaccuracy is less than the threshold value, control again passes to block1984. However, if ASR_(current) caption accuracy exceeds the thresholdvalue, control passes to block 1987 where the CA is disconnected fromthe call and control loops back up to block 1954 where the processdescribed above continues to cycle. Thus, if neither the AU nor CAassociated with a call enters a command requiring persistent CA errorcorrections to ASR text, blocks 1983, 1985 and 1987 cause disconnectionof the CA when the accuracy of the current ASR exceeds the high accuracythreshold level.

Referring yet again to block 1984, where CA error corrections persist,the ASR_(current) captions are presented along with the HU voice to theCA and at block 1986 the CA corrects any errors in the ASR_(current)captions. Corrections are transmitted to the AU captioned device for inline or other correction to the captions presented to the AU.

Referring again to FIG. 60, during at least the first few times throughthe process 1950, the ASR1 Accuracy will likely be greater than the ASR2accuracy but, eventually, as the ASR2 engine trains using the CA errorcorrections, the ASR2 accuracy will exceed ASR1 accuracy so that, atblock 1976, control passes down to block 1978 where ASR1 is disabled. Atblock 1980, ASR_(current) is set equal to ASR2 and then control passesto block 1982 where ASR_(current) is again sent to the AU captioneddevice for display and the sub-process including blocks 1984 through1988 continues as described above. After block 1988, control passes backup to block 1954 where the process described above continues to cycle.

Referring yet again to FIG. 60 and more specifically to block 1970, onceASR1 has been disabled, in at least some embodiments there is no reasonto continue to track ASR accuracy metrics as it would be assumed thatASR2 which continues to train off CA error corrections remainsrelatively more accurate than ASR1. For this reason, at block 1970, onceASR1 is disabled, control would route down to block 1982 where ASRcaptions from ASR2 are transmitted to the AU captioned device fordisplay.

Thus, referring again to FIG. 60, once ASR2 accuracy exceeds ASR1accuracy, in at least some embodiments of disclosed captioning systems,process 1950 is paired down so that all of sub-process 1956 as well asprocess and decision blocks 1972, 1973, 1974, 1976, 1978 and 1980 areeffectively disabled.

In at least some cases, the system may implement hysteretic ASR changewhereby ASR2 must be more accurate for some threshold level of duration,number of captioned words, etc., prior to switching from ASR1 to ASR2.For instance, at block 1976, ASR2 may have to be more accurate than ASR1for 15 consecutive seconds of HU voice prior to control passing to block1978. In other cases, ASR2 may have to be more accurate than ASR1 for aduration corresponding with 50 words uttered by an HU. In still othercases, ASR2 may have to be at least 15% more accurate than ASR1 for atleast 20 consecutive seconds at block 1976 prior to control passing toblock 1978.

While process 1950 is described above as one where a caption sessionstarts with ASR1 and switches once to ASR2 to generate captions whereASR1 is disabled once ASR2 takes over, in at least some cases it iscontemplated that the ASR2 sub-process 1958 (see again FIG. 60) maypersist for the entire length of a captioning session so that an ASR2accuracy metric is persistently calculated and, when the ASR2 accuracymetric value drops below some threshold value, the system mayautomatically reinitiate a cloud based ASR session. Here, during thereinitiated ASR session, the ASR1 process would again generate an ASR1captions and an accuracy metric to compare to the ASR2 accuracy metric.In this case, if the ASR1 accuracy metric again exceeds the ASR2accuracy metric, the system would reset ASR_(current) to ASR1 and theprocess described above would continue to cycle. Thus, here, the systemmay switch back and forth between ASR1 and ASR2 based on calculatedaccuracy metrics but, advantageously, the cloud based ASR1 would betaken out of the process at least some of the time which would reducecaptioning costs substantially.

As in cases where a CA has the ability to manually switch between (i) CAgenerated and corrected text and (ii) ASR text correction per CApreferences, in at least some cases a CA will be able to manually switchbetween two or more ASRs based on CA preference, instantaneousperception related to which ASR will be most accurate at a specifictime, etc., or between two or more ASRs as well as CA caption/errorcorrection mode. In addition, in some cases, the system will providecoaching to a CA suggesting changes to captioning protocol (e.g., ASR1,ASR2, CA, etc.) based on accuracy or other metrics.

While ASR1 in FIG. 1950 is described as being a cloud based ASR, inother embodiments ASR1 may also be a highly accurate relay hosted ASR,albeit one that does not train using CA error corrections. Here, as inthe cases of the cloud based ASR system, ASR1 may be a more expensivecaptioning option than ASR2 and, in that case, the system mayautomatically switch out to ASR2 when ASR2 accuracy exceeds ASR1 so thatASR1 can be disabled.

Optimized ASR Selection Prior to Captioning

In cases where a relay selects from among several ASRs based on callcharacteristics such as voice type (e.g., pitch, tone, volume, etc.),voicing speed (e.g., words per minute), accent, etc., line noise, linesound quality, etc., in at least some embodiments it is contemplatedthat any call involving an AU may be linked immediately to a relayirrespective of whether or not captioning is to commence immediately andan HU voice may be transmitted to the relay even prior to a captioningrequest. In this case, a relay processor may be programmed to analyzethe HU voice signal to identify call characteristics and may then selectan optimal ASR for captioning the call if and when the captioning isrequired. Thus, for instance, a first ASR1 may be better suited toaccurately caption when line noise is substantial and a second ASR2 maybe better suited when line noise is below some threshold level. Here,ASR1 or ASR2 would be preselected prior to initiation of a captioningsession so that the optimal ASR can start automatic captioning oncecaptioning required/requested. Many other call characteristics arecontemplated that could be identified prior to captioning an used toselect an optimal ASR.

In at least some embodiments, at least an initial determination of whichASR to use to handle a call may be made by an AU device. To this end,the AU device may run software that listens to an HU voice, identifiescharacteristics of that voice signal and that then uses thosecharacteristics to identify one of several ASRs optimized to generatemost accurate captions. In this case, the AU device may transmit anoptimized ASR control signal to a relay or third party ASR providerbefore or after an AU generates a caption request and the relay orprovider would then use the ASR associated with the optimized controlsignal to at least initiate ASR captioning if and when captioning isrequired.

In still other cases, an HU communication device may also be programmedto listen to an HU voice signal, identify voice characteristics and usethe identified characteristics to identify one of several ASRs optimizedto generate most accurate captions given the voice characteristics.Again, the HU device would transmit the optimized ASR control signal toa relay or third party ASR provider before or after a captioning requestand the relay or provider would use the optimized ASR to at leastinitially caption the HU voice signal once captioning is required.

In at least some cases, where an ASR can train without CA errorcorrections, ASR training for a specific call may occur prior to acaptioning request. To this end, again, at the beginning of an HU-AUcall and prior to the AU requesting captioning, the HU voice signal maybe provided to a relay or a third party ASR provider. Here, a relay orprovider processor may use an ASR to caption the HU voice signal and mayattempt to identify caption errors based on content within the generatedcaptions. Caption errors can then be used to better train the ASR sothat, subsequently when captioning is requested, the trained ASR can beused immediately to generate relatively more accurate captions.

HU Voice Signal Conditioning

In at least some embodiments a relay may be programmed to condition HUvoice signals received from other system components (e.g., an AUcaptioned device, an HU phone device, etc.) to optimize those signalsfor other purposes. For instance, a simple example of an HU voicecharacteristic that may be adjusted by a relay processor to optimize forASR captioning and broadcast to a CA is HU voice signal volume where aprocessor may adjust volume to be substantially continuous and identicalfor each of an ASR and a CA or continuous and at different levels for anASR and a CA. In other cases, the HU voice signal volume may be adjustedto be substantially continuous for a CA but may be fed to an ASR atwhatever volume the HU generated the signal. Another voicecharacteristic that may be adjusted is speaking pace. For instance, insome cases an HU may alternate from speaking quickly to slowly. In thiscase, a relay processor may adjust speaking pace in the HU signalbroadcast to a CA so that the overall pace is constant. Here, in atleast some cases where an ASR operates to generate HU voice signalcaptions, the ASR will often outpace a CA in captioning. In this case,the ASR results can be examined by the processor for pace so that whenthe voice signal is subsequently broadcast to the CA, the voice signalpace can be rendered substantially constant.

In some cases different CAs may prefer different HU voice signal pacesand in those cases it is contemplated that a CA may be able to set HUvoice signal pace or that a relay processor may be programmed to “hunt”for a CA optimal pace and automatically adjust pace in real time for CAoptimization. Thus, the processor may adjust HU voice signal pace andmonitor CA accuracy and speed so that the pace can be modified untiloptimized. Other voice characteristics may be optimized as well for CAbroadcast and consumption by one or more different ASRs.

In at least some cases one or more biometric sensors may be includedwithin an AU's caption device that can be used for various purposes. Forinstance, see again FIG. 1 where a camera 75 is included in device 12for obtaining images of an AU using the caption device 12 during a voicecommunication with an HU. Other biometric sensor devices arecontemplated such as, for instance, the microphone in handset 22, afinger print reader 23 on device 12 or handset 22, etc., each of whichmay be used to confirm AU user identity.

One purpose for camera 75 or another biometric sensor device may be torecognize a specific AU and only allow the captioning service to be usedby a certified hearing impaired AU. Thus, for instance, a softwareapplication run by a processor in device 12 or that is run by the systemserver 30 may perform a face or eye recognition process each time device12 is activated, each time any person locates within the field of viewof camera 75, each time the camera senses movement within its FOV, etc.In this case it is contemplated that any AU that is hearing impairedwould have to pre-register with the system where the system is initiallyenabled by scanning the AU's face to generate a face recognition modelwhich would be stored for subsequent device enablement processes.

In other cases it is contemplated that hearing specialists of physiciansmay, upon diagnosing an AU with sufficient hearing deficiency to warrantthe captioning service, obtain an image of the AU's face or an entire 3Dfacial model using a smart phone or the like which is uploaded to asystem server 30 and stored with user identification information tofacilitate subsequent facial recognition processes as contemplated here.In this way, AUs that are not comfortable with computers or technologymay be spared the burden of commissioning their caption devices at homewhich, for some, may not be intuitive.

After a caption device is set up and commissioned, once an authorized AUis detected in the camera FOV, device 12 may operate in any of the waysdescribed above or hereafter to facilitate captioned or non-captionedcalls for an AU. Where a person not authorized to use the captionservice uses device 12 to make a call, device 12 may simply not provideany caption related features per the graphical display screen so thatdevice 12 operates like a normal display based phone device.

In other cases images or video from camera 75 may be provided to an HUor even a CA to give either or both of those people a visualrepresentation of the AU so that each can get a sense from non-verbalqueues of effectiveness of AU communications. When a visualrepresentation of the AU is presented to either or both of the HU andCA, some clear indicator of the visual representation will be given tothe AU such as for instance, a warning message on display 18 of device12. In fact, prior to presenting AU images or video to others, device 12may seek AU authorization in a clear fashion so that the AU is notcaught off guard.

In at least some embodiments described above, ASR or other currentlybest caption text (e.g., CA generated text in a full CA mode ofoperation) is presented immediately or at least substantiallyimmediately to an AU upon generation and subsequently, when an error inthat initial text is corrected, the error is corrected within the textpresented to the AU by replacing the initial erroneous text withcorrected text. To notify the AU that the text has been modified, thecorrected text is highlighted or otherwise visually distinguished inline. It has been recognized that while highlighting or other tagging todistinguish corrected text is useful in most cases, those highlights ortags can become distracting under certain circumstances. For instance,when substantial or frequent error corrections are made, the new texthighlighting can be distracting to an AU participating in a call.

In some cases, as described above, a system processor may be programmedto determine if error corrections result in a change in meaning in anincluding sentence and may only highlight error corrections that aremeaningful (e.g., change the meaning of the included sentence). Here,all error corrections would be made on the AU device display but onlymeaningful error corrections would be highlighted.

In other cases it is contemplated that all error corrections may bevisually distinguished where meaningful corrections are distinguished inone fashion and minor (e.g., not changing meaning of including sentence)error correction are distinguished in a relatively less noticeablefashion. For instance, minor error corrections may be indicated viaitalicizing text swapped into original text while meaningful correctionsare indicated via yellow or green or some other type of highlighting.

In still other cases all error corrections may be distinguishedinitially upon being made but the highlighting or other distinguishingeffect may be modified based on some factor such as time, number ofwords captioned since the error was corrected, number or errorcorrections since the error was corrected, or some combination of thesefactors. For example, an error correction may initially be highlightedbright yellow and, over the next 8 seconds, the highlight may be dimmeduntil it is no longer visually identifiable. As another example, a firsterror correction may be highlighted bright yellow and that highlightingmay persist until each of a second and third error correction thatfollows the first correction is made after which the first errorcorrection highlighting may be completely turned off. As yet one otherinstance, an error correction may be initially highlighted bright yellowand bolded and, after 8 subsequent text words are generated, thehighlighting may be turned off while the bold effect continues. Then,after a next two error corrections are made, the bold effect on thefirst error correction may be eliminated. Many other expiring errorcorrection distinguishing effects are contemplated.

Referring now to FIG. 56, a screen shot of an AU interface is shown thatmay be presented on a caption device display 18 that shows caption textthat includes some errors where a first error is shown corrected at 2102(e.g., the term “Pal's” has been corrected and replaced with “Pete's”).As illustrated the new term “Pete's” is visually distinguished in twoways including highlighting and changing the font to be bold and italic.

Referring also to FIG. 57, a screen shot similar to the FIG. 56 shot isshown, albeit where a second error (e.g., “John”) has been corrected andreplaced in line with the term “join” 1204. In this example, thecorrection distinguishing rules are that a most recent error correctionis highlighted, bold and italic, a second most recent error correctionis indicated only via bold and italic font (e.g., no highlighting) andthat when two error corrections occur after any error correction, theearliest of those corrections is no longer highlighted (e.g., is shownas regular text). Thus, in FIG. 57, the error correction at 1202 is nowdistinguished by bold and italic font but is no longer highlighted andthe most recent error correction at 1204 is highlighted and shown viabold and italic font.

Referring to FIG. 58, a screen shot similar to the FIG. 56 and FIG. 57shots is shown, albeit where a third error (e.g., “rest ant”) has beencorrected and replaced in line with the term “restaurant” 2106.Consistent with the correction distinguishing rules described above, themost recent correction 1206 is shown highlighted, bolded and italic, theprior error correction at 1204 is shown bolded and italic and the errorcorrection at 1202 is shown as normal text with no special effect.

In any case where a second CA is taking over primary captioning fromeither an ASR or a first or initial CA at a specific point in an HUvoice signal, the system may automatically broadcast at least a portionof the HU voice signal that precedes the point at which the second CA istaking over captioning to the second CA to provide context for thesecond CA. For instance, the system may automatically broadcast 7seconds of HU voice signal that precede the point where the second CAtakes over captioning so that when the CA takes over, the CA has contextin which to start captioning the first few words of the HU voice signalto be captioned by the CA. In at least some cases the system may audiblydistinguish HU voice signal provided for context from HU voice signal tobe captioned by the CA so that the CA has a sense of what signal tocaption and which is simply presented as context. For instance, the toneor pitch or rate of broadcast or volume of the contextual HU voicesignal portion may be modified to distinguish that portion of the voicesignal form the signal to be captioned.

Systems have been described above where ongoing calls are automaticallytransferred from a first CA to a second CA based on CA expertise inhandling calls with specific detected characteristics. For instance, acall where an HU has a specific accent may be transferred mid-call to aCA that specializes in the detected accent, a call where a line isparticularly noisy may be transferred to a CA that has scored well interms of captioning accuracy and speed for low audio quality calls, etc.

One other call characteristic that may be detected and used to directcalls to specific CAs is call subject matter related to specifictechnical or business fields where specific CAs having expertise inthose fields will typically have better captioning results. In thesecase, in at least some embodiments, a system processor may be programmedto detect specific words or phrases that are tell tail signs that callsubject matter is related to a specific field or discipline handled bestby specific CAs and, once that correlation is determined, an associatedcall may be transferred from an initial CA to a second CA thatspecializes in captioning that specific subject matter.

In some cases an AU may work in a specific field in which the AU andmany HUs that the AU converses with use complex field specificterminology. Here, a system processor may be programmed to learn overtime that the AU is associated with the specific field based onconversation content (e.g., content of the HU voice signal and, in somecases, content of an AU voice signal) and, in addition to generating anutterance and text word dictionary for an AU, may automaticallyassociate specific CAs that specialize in the field with any callinvolving the AU's caption device (as identified by the AU's phonenumber or caption device address). For instance, if an AU is aneuroscientist and routinely participates in calls with industrycolleagues using complex industry terms, a system processor mayrecognize the terms and associate the terms and AU with an associatedindustry. Here, specific CAs may be associated with the neuroscienceindustry and the system may associate those CAs with the calling numberof the AU so that going forward, all calls involving the AU are assignedto CAs specializing in the associated industry whenever one of those CAsis available. If a specialized CA is not available at the beginning of acall involving the AU, the system may initiate captioning using a firstCA and then once a specialized CA becomes available, may transfer thecall to the available CA to increase captioning accuracy, speed or both.

In some cases it is contemplated that an AU may specify a specific fieldor fields that the AU works in so that the system can associate the AUwith specific CAs that specialize in captioning for that field or thosefields. For instance, in the above example, a neuroscientist AU mayspecify neuroscience as her field during an caption device commissioningprocess and the system may then associate ten different CAs thatspecialize in calls involving terminology in the field of neurosciencewith the AU's caption device. Thereafter, when the AU participates in acall and requires CA captioning, the call may be linked to one of theassociated specialized CAs when one is available.

In some embodiments it is contemplated that a system may track AUinteraction with her caption device and may generate CAS preference databased on that interaction that can be used to select or avoid specificCAs in the future. For instance, where an AU routinely indicates thatthe captioning procedure handled by a specific CA should be modified,once a trend associated with the specific CA for the specific AU isidentified, the system may automatically associate the CA with a list ofCAs that should not be assigned to handle calls for the AU.

In some cases it is contemplated that the system may enable an AU toindicate perceived captioning quality at the end of each call or at theend of specific calls based on caption confidence factors or some othermetric(s) so that the AU can directly indicate a non-preference for CAs.Similarly, an AU may be able to indicate a preference for a specific CAor that a particular caption session was exceptionally good in whichcase the CA may be added to a list of preferred CAs for the AU. In thesecases, calls with the AU would be assigned to preferred CAs and notassigned to CAs on the non-preferred list whenever possible. Here, atthe end of each of a subset of calls, an AU may be presented with touchselectable icons (e.g., “Good Captioning”; “Unsatisfactory Captioning”)enabling the AU to indicate satisfaction level for captioning servicerelated to the call and those satisfaction indications would be used tocategorize CAs for the specific AU.

Sensor(s) Added to AU System

A CA workstation is described above with respect to FIGS. 54 and 55where a camera is integrated into the CA station and a station processoruses images from the camera to track a sight trajectory for the CA sothat what a CA is viewing can be used as a system input for one orseveral different applications or system features. In at least someembodiments it is contemplates that an AU captioned system may besimilarly equipped with a camera and eye or at least face trackingsoftware so that an AU's sight trajectory or likely sight trajectory canbe used as another input for controlling caption system operations.

To this end, see FIG. 62 that includes a camera 2200 aimed toward anarea adjacent a display screen 2210 in which an AU's eyes 2202 arelikely to be located when the AU is viewing the display screen (e.g.,within 2-3 feet of the display screen). Camera 2200 generates real timeimages of an AU's eye(s) 2202 which are examined by a system processorto identify the sight trajectory of the AU's eye gaze and, morespecifically, what (e.g., which “object”) the AU is instantaneouslylooking at on the display screen. For instance, in some cases, theprocessor will ascertain a word or phrase within a caption field 2212instantaneously viewed by the AU within presented captioned text. Inother cases, the system may ascertain that the AU is instantaneouslylooking directly at a telepresence type video field or window 2206showing a HU on the other end of an ongoing call. In the FIG. 62example, the AU is instantaneously looking at the last captioned word“not”.

It has been recognized that in many cases an AU's gaze or sighttrajectory can be used as a rough proxy for an AU's instantaneousunderstanding/confusion related to an ongoing call. To this end, in mostcases when an AU uses a captioned phone device or system as described inthe present disclosure, when the AU fails to hear or comprehend asegment of the HU's voice signal and therefore is instantaneouslyconfused, the AU will immediately look to the captioned device displayscreen to see captions associated with the HU voice signal to clarifyunderstanding. More specifically, in most cases, when an AU is confused,the AU will look to the most recent captions segment presented on thedisplay screen as those captions are best aligned with the instant intime at which the AU became confused. In many cases, when an AUunderstands a HU voice signal, the AU will look away from the captioningdisplay screen or captioned text so as to not be distracted by thepresented text (e.g., to concentrate on the audio part of thecommunication as opposed to text captions, minimize eye strain,concentrate vision on some other object within the AU's vicinity, etc.).It is worth noting that in many cases, AUs are only partially hearingimpaired (e.g., can hear at least somewhat) and in fact, for most oftheir lives, had perfectly good hearing capability and are accustomed toand even prefer consuming HU voice signals audibly, not via captions, sosight trajectories away from captions are often chosen.

Thus, in many cases, an AU's instantaneous gaze can be used as a proxyfor when the AU is confused by an HU voice signal segment and when theAU understands the segment. In many cases, even when captioning isenabled, most of the time an AU simply does not view captions. Forinstance, an AU may prefer to look out a window adjacent her captioneddevice while communicating with and audibly understanding an HU. Asanother instance, in a case where a telepresence type HU video 2206 (seeagain FIG. 70) is presented adjacent text captions, an AU may more oftenthan not view the telepresence video to make “virtual eye contact” withthe HU during many call segments. Again, the AU may only look atcaptions when periodically confused.

In at least some embodiments, an AU device or system processor may beprogrammed to control the captioning service automatically so thatdifferent quality services are provided when the AU is viewing captionsand when the AU is not viewing the captions. For instance, when an AU iscurrently looking away from a captioned device display screen (e.g.,toward an area laterally adjacent the display screen), the system mayfacilitate a relatively inexpensive and quick captioning process suchas, for instance, one where high speed ASR text is generated andpresented on the display screen without CA error correction. Here, thehigh speed ASR text gives a sense that the captioning process is ongoingand presents essentially real time glanceable captions that are correctmost of the time.

In the above example, if the AU changes sight trajectory and looks atthe captioned text or screen, the new trajectory is detected and a CAmay be automatically and immediately connected to the call and presenteda most recent segment of ASR generated text (e.g., last 10 seconds, last10 words, etc.) as well as HU voice signal associated with the mostrecent ASR generated text segment for captioning. Here, the CA correctsany perceived errors in the text and those corrections are transmittedto the AU device to immediately drive in line or other caption errorcorrections. In at least some cases, while the AU's sight trajectory isstill aimed at the display screen and, more specifically, the captionfield 2212 (see again FIG. 62), the CA remains connected to the call tofacilitate ongoing error corrections.

If the AU again changes sight trajectory to look away from the displayscreen or caption field 2212 prior to the CA correcting any or some ofthe perceived errors in the most recent text segment, the CA may bedisconnected from the call or CA error correction may be disabled basedon the assumption that the AU's new sight trajectory is a proxyindicating that the AU understands the most recent HU voicecommunication (e.g., there is no need for CA error correction if the AUis satisfied with her understanding of the HU voice signal). In thisexample, once captions are requested, at least some captions are alwayspresented immediately upon generation (e.g., the ASR captions) and CAerror correction is only enabled when an AU's sight trajectory indicateslikely confusion.

In other embodiments where a first CA generates captions and a second CAerror corrects, the first CA may be persistently on a call forgenerating initial uncorrected captions and the second CA may only belinked to the call to error correct when the AU's sight trajectory isagain aimed at the display screen captions.

While the above description of AU sight trajectory input for controllingsystem operation is described in the context of a processor thatdistinguishes between AU sight trajectory aimed at captions and awayfrom a display screen, the system may enable and disable CA errorcorrection of recent ASR text when an AU's sight trajectory is at thecaption field and at the telepresence video 2206, respectively. Thus,when an AU is making virtual eye contact with an HU presented in field2206 (e.g., looking at the HU image in field 2206), CA error correctionsmay be disabled as the AU is not viewing the captions anyway. Then, whenan AU looks at the presented captions field 2212, CA error correctionfor the most recently presented text may be enabled, at least until theAU again changes sight trajectory and looks away from the caption field2212.

Other system operation may be automatically controlled based on AU sighttrajectory. To this end, for instance, where ASR captions areinstantaneously presented to an AU and a CA error corrects ASR captionsand is behind in error correcting by at least some threshold duration(e.g., 20 seconds), when the AU changes gaze from looking away frompresented captions to looking at the presented captions, the CA may beautomatically skipped ahead to the most recently presented captions inorder to correct captions most commensurate in time with the instantthat the AU's gaze indicates possible confusion. Here, in at least somecases, intervening caption errors between the point the CA wascorrecting at and the recent captions may simply be ignored and notcorrected. In other cases, intervening errors may be corrected by asecond temporary CA or by an attending CA at a later time (e.g., after acall ends, during a silent duration of an on-going call, etc.).

In at least some cases the camera and processor that assess AU sighttrajectory only needs to be able to assess sight trajectory verygranularly as opposed to precisely. In this regard, the system may onlyneed to assess two states, one in which an AU's sight trajectorysubtends the caption field 2012 on a screen and all other trajectories.Thus, here, any time an AU's sight trajectory subtends any locationwithin caption field 2012, it may be assumed that the AU is audiblyconfused so that high quality CA corrected captions are required and,any time an AU's sight trajectory is aimed outside the caption field2012, it may be assumed that the AU is not audibly confused so thatlower quality ASR captions are optimal. Thus, in at least someembodiments the camera and processor are programmed to recognize gaze atthe captioned text field 2012 and gaze along any other trajectory.

In other cases more precise AU sight trajectory may be required such as,for instance, a level of precision such that the processor can calculatewhich word in captioned text presented is being focused on. Forinstance, in some cases it may be assumed that the last word an AUfocuses on within presented text prior to looking away is the point atwhich the AU's understanding can be assumed. For instance, in FIG. 62,if an AU last gazed at the phrase “Walnut Avenue”, the processor may beprogrammed to move the CA error corrections up to the point in the textjust after the “Walnut Avenue” phrase.

In some cases the system may automatically change appearance of objectson the AU captioned device display screen based on AU sight trajectory.For instance, when an AU's sight trajectory switches from telepresencevideo field 2206 to caption field 2012, the caption field and relatedtext may be enlarged and the telepresence field 2206 may be shrunk to asmaller size to accommodate larger captions. When the AU again lookstoward telepresence field 2206, caption field 2012 size may again bereduced and field 2206 size may be increased. As another instance, whenan AU views telepresence field 2206 (e.g., sight trajectory is aimed atthat field), field 2206 may be bright and caption field 2212 may bedimmed and when the AU's sight trajectory is altered to aim at captionfield 2212, that field may be bright while the telepresence field 2206is dimmed. Other visual characteristics of different fields may also bemodified based on AU sight trajectory and in some cases combinations ofcharacteristics may be modified.

While sight trajectory is often a good proxy for AUconfusion/understanding state, other AU activities may also be used in asimilar fashion. For instance, orientation of an AU's head and morespecifically face may be a good proxy for sight trajectory and thereforethe AU's confusion/understanding. Thus, where an AU's face is orientedto face the captioned device display screen, the processor may beprogrammed to link a CA to an ongoing call for error correction and whenthe AU's face is not oriented to face the captioned device screen, theprocessor may be programmed to disconnect the CA from an ongoing call aserror correction would not be required.

Here, the idea is that in many cases CA error correction is not neededmost of the time and therefore, N CAs should be able to providecaptioning services when required for more than N simultaneous calls andthus the cost to provide caption services should be able to be reduced.For instance, in a simple case where ten simultaneous ongoing callsoccur and each AU views captions during only 10% of each call, three orfour CAs should be able to provide error corrections for all of thecalls assuming that at least some of the time three or four AUs willview captions simultaneously.

In at least some cases an AU system or device processor may beprogrammed to monitor AU sight trajectory over time and, if the AUroutinely views captions while using the captioned device, may keep a CAconnected to a call persistently even when the AU periodically looksaway from the display screen. For instance, if an AU's sight trajectoryis aimed at the caption text field 2012 four or more times in a minute,an error correcting CA may remain linked to a call thereafter or untilsome other threshold of time elapses without the AU looking at thecaption field (e.g., the AU looks away from the field for at least oneminute). AU sight trajectory tracking over time may be during a singlecall or between calls so that, if a specific CA routinely looks at thecaption field many times during a call, a CA may always be assigned tothat AU's calls and persistent provide error correction when needed.

In still other cases, whether or not a CA is connected to a call tocorrect errors in recent captions when an AU's sight trajectory is aimedat captions or a captioned device display may depend on confidencefactors associated with the recent captions. For instance, in a casewhere an ASR assigns confidence factors to ASR captioned words orphrases, if a high confidence factor is assigned to the most recent ASRcaption phrase presented on a captioned device display when an AU looksat the phrase, the system processor may forego linking an errorcorrecting CA to the call as the captions presented would highly likelybe accurate. In this same case if the confidence factor assigned to themost recent ASR generated phrase is low, the processor may automaticallylink an error correcting CA to a call when an AU's sight trajectory isat the display screen of caption field. This feature should furtherreduce the number of error correcting CAs required to handle a pluralityof simultaneous calls.

FIG. 63 includes a flow chart illustrating a process 2300 for using AUsight trajectory to control CA error correction functions that isconsistent with at least some aspects of the present disclosure. Atprocess block 2290 an HU-AU call is initiated and at block 2292 an errorcorrecting CA is linked to the call and ASR text and associated HU voicesignal are presented to the CA for error correction. At block 2302,images from a captioned device camera 2200 (FIG. 62) are examined by aprocessor to identify AU eye or sight trajectory. At decision block 2304the processor determines if the AU's sight trajectory is frequentlydirected at the caption field 2012 (see again FIG. 62). Here, whatconstitutes “frequent” is a matter of designer choice and sets athreshold for when CA error correction will persistently be enabledbecause an AU routinely views caption text. For instance, again, onethreshold may be if an AU views captions four or more times within aminute, another may be that an AU views captions more than 40% of thetime during a recent call segment (e.g., 60 seconds), etc.

In FIG. 63, if an AU frequently views captions at block 2304, controlpasses down to block 2310 where the link to the CA is maintained and CAerror correction persists. At block 2304, if the AU does not frequentlyview the captions, control passes to block 2306 where a processordetermines if the AU is instantaneously looking at the captions on thedevice display. If the AU is not currently looking at the captions onthe display, control passes to block 2314 where the CA is disconnectedform the call and then control passes back up to block 2302 where theprocess continues to cycle. Thus, here, cost associated with CA errorcorrection is avoided for at least a portion of a call.

Referring still to FIG. 63, at block 2306 if the AU's instantaneoussight trajectory is at the displayed captions, control passes todecision block 2308 where the processor next determines if ASRconfidence factors for the most recent captions phrase or phrases ishigh (e.g., above some threshold) or low. Where the confidence factor(s)is high, control passes to block 2314 where, again, the CA isdisconnected from the call to save costs associated with CA errorcorrection. Where the confidence factor(s) is high, control passes fromblock 2308 to block 2310 where the link to the error correcting CA iswither maintained or a new link to an error correcting CA is establishedand ASR text and associated HU voice signal for recent ASR phrases ispresented to the linked CA for error correction. After block 2310control passes to block 2312 where CA error corrections are transmittedto the AU captioned device to drive in line or other caption errorcorrection after which control passes back up to block 2302 and theprocess described above continues to cycle.

FIG. 63A includes a flowchart that illustrated another captioningprocess 2287 where CA error correction only occurs for low confidenceASR text and only when an AU sight trajectory is directed at ASRcaptions and where alternate CAs handle consecutive low confidence ASRtext segment corrections. At block 2291 an HU-AU call commences and atdecision block 2293 a processor determines if an AU has requestedcaptioning or if captioning is set as a default. Where captioning is noton control loops back through block 2293 where the system waits forcaption initiation. Once captioning is initiated control passes to block2295.

Referring still to FIG. 63A, at block 2295, a system device links to therelay and the HU voice signal is provided to the relay. At block 2291 anASR generates captions that are transmitted to the AU device to bepresented immediately to the AU for consideration. At block 2293 the ASRor some other system processor identifies any low confidence factor ASRcaptioned text. When a low confidence factor text word or phrase isidentified, at 2295 that word or phrase along with surrounding text orphrases for context are sent to a linked CA for broadcast, viewing,consideration and when needed, error correction by the CA. Here, the CAis linked to the call and receives only one or a small number of captionsegments, each including a single low confidence word or phrase forerror correction or affirmation (e.g., no correction if accurate). Forinstance, in at least some cases low confidence text segments will bepresented to the first CA until the first CA error correction delayexceeds some threshold duration of HU voice signal (e.g., 10 seconds)(see block 2295).

In other cases the first CA may be provided a predefined small number oflow confidence factor texts for consideration (e.g., 2, 5, 10, etc.). Instill other cases the first CA may handle any low confidence factor textcorrections that occur during a 40 second segment of the call. Once thelow confidence caption segment(s) is corrected or affirmed, the CA isdelinked from that call and is available for other calls.

When a next low confidence caption text which follows the textsconsidered by the first CA is identified, a link to a second CA isestablished and that next low confidence text along with surroundingwords or phrases for context is presented to the second linked CA forbroadcast, viewing, consideration and, when needed, error correction bythe second CA. Again, the second CA is only linked to the call to handlea subset of low confidence text error correction tasks and is thendelinked from the call and is available to handle error corrections on adifferent call. This process of consecutively linking to different CAsto handle sequential low confidence factor text consideration continuesuntil the call ends or some other event causes CA error corrections tocease (e.g., ASR accuracy exceeds some required threshold so CAs aredelinked generally, the AU looks away from the captions so there is noneed for CA level accuracy, etc.). Between CA linkages to a call, whileno CA may be linked to the call, in at least some embodiments thecommunication line or link to the relay remains intact so that reliningto a CA when next needed can be expedited.

Referring still to FIG. 63A, at block 2297 a system processor (e.g., AUcaptioned device processor) tracks AU sight trajectory and at decisionblock 2299 the processor determines if the AU is looking at thecaptions. Where the AU is not looking at the captions, control passes toblock 2309 where any CA that is linked to the call to receive HU voicesignal is delinked from the call, again the call-relay link orcommunication line is maintained during the duration of the call in atleast some embodiments. After block 2309 control passes back up to block2297 where the process described above and hereafter continues to loop.

Referring again to FIG. 63A, at block 2299, where the AU is looking atthe captions on the captioned device display, control passes to block2301 where the processor determines if recent caption phrases (e.g., inlast 10 seconds or including phrases currently presented on the AUcaptioned device display) are high confidence or low confidence. Wherethe most recent captions are all high confidence, control passes againto block 2309 where the processor delinks any linked CAs whilemaintaining the relay connection so a new CA linkage can be establishedquickly when needed.

At block 2301, when a low confidence factor is associated with at leastone of the most recent ASR caption phrases, control passes to block 2303where an existing CA link is maintained or, if there is no existing CAlink, a new CA link is established and the low confidence factor ASRtext and associated HU voice signal is presented to the linked CA forcorrection consideration. CA error corrections are received at 2305 andare transmitted to the AU captioned device for in line or othercorrection. After block 2305 control loops back up to block 2297 wherethe process above continues to loop.

Other rules for caption system control based on AU sight trajectory andother sensed AU factors are contemplated.

AU sight trajectory can be used to optimize system operation in otherways. For instance, where an AU looks away from captions presented on acaptioned device display for some time (e.g., at least a thresholdduration (20 seconds), if CA error corrections are behind HU voicesignal by some duration, the CA may be automatically moved ahead in theHU voice signal to reduce error correction latency. For example, assumea CA is 30 seconds behind on error correcting an HU voice signal andthat for the last 20 seconds, the AU has been looking away from thecaptions presented on her captioned device display. In this case, again,the AU's sight trajectory away from the captions is often a good proxyindicating that the AU understands the HU voice signal recently heard(e.g. during the last 20 seconds). In this case, the system may skip theCA ahead by 20 seconds within the HU voice signal and ASR captions sothat the CA immediately error corrects more recent ASR captions that arebetter aligned with AU confusion that may occur next. Here, the benefitis that ASR captions are corrected that are most likely to be associatedwith audio that causes AU confusion. Again, AU's typically refer tocaptions when confused and therefore most recent captions are typicallyassociated with AU confusion and accuracy of those captions is moreimportant than accuracy of captions that the AU does not view during acall.

Automatically Adjusting Captioning System

The captioning systems disclosed above have many different operatingparameters and characteristics such as, for instance, more or less ASRcaptioning and error correction, more or less CA captioning and errorcorrection, characteristics related to when ASRs are used for whichcalls as well as which ASRs are used for different parts or calls, whenASRs are selected, when line connections are made, how text is presentedto CAs, AUs and, in some cases, HUs, how and when text errors arecorrected and indicated, etc. While rules governing captioningcharacteristics may be programmed and automatically implemented or, insome cases, at the request of an AU, a CA or an HU, in some cases it iscontemplated that the system may be programmed to learn user preferencesor tendencies so that the system can automatically adjust and optimizeoperation for specific users. For instance, where an AU routinely firmsup caption text presented on her captioned device display screen (e.g.,see icon 221 in FIG. 17) (e.g., during an ongoing call or during astring of several calls) prior to CA error correction of the most recentpresented text, the system may automatically adjust operation to removethe CA from the captioning process at a lower accuracy threshold than adefault initial threshold so that a full ASR system becomes operationalmore rapidly. Here, the idea is that if the AU essentially never or onlyminimally benefits from CA error corrections because of how the AU usesthe system, the system adjusts to take out the “expensive” CA servicemore rapidly to reduce overall cost.

As another instance, where an AU routinely requires text catch up sothat CA error corrections are always within the most recent thresholdduration of an HU voice signal (e.g., the last 15 seconds), the systemmay automatically adjust operation so that a CA cannot fall behind acurrent HU voice signal by more than 15 seconds. Here, the adjustmentmay manifest itself in a CA interface where ASR text corresponding to HUvoice signal prior to the most recent 15 seconds is firmed up and cannotbe corrected or otherwise changed by the CA which ensures that the CA isalways correcting errors that are commensurate with the text that the AUmost cares about. In still other cases, other sensed AU activities maybe used to automatically adjust system operation.

Other Text Firming Rules

Several different rules for firming up text or errors generated by anASR or a CA have been described above where one or the other or both ofan ASR and a CA are prohibited at some point from further caption errorcorrections. In other embodiments it is contemplated that there may beno rules for firming up text so that either of a CA or an ASR thatgenerates a most recent caption error correction can drive errorcorrections on a CA display screen or in captions presented to an AU. Inother cases there may be other tie breaker rules such as, for instance,if a second error correction occurs even a split second after a first,the second error correction is implemented or, in the alternative, thefirst error correction is implemented, and the non-implementedcorrection is discarded. In still other cases, any CA error correctionto a specific word or phrase may be treated as truth and firm up thecorrected text while all other text that is not CA corrected may stillbe fair game for ASR error corrections. Other rules for implementingconflicting ASR and CA errors are contemplated.

AU Split Screen Viewing

It has been recognized that in at least some cases an AU may want tohave real time Hu voice signal captioning while also having the abilityto simultaneously view prior call captions during an ongoing call. Forinstance, during a long call, an AU may be interested in reviewing whatan HU said several minutes ago. In some cases the AU may be able tosimply scroll up on captions presented on a captioned device displayscreen to see prior captions. In a particularly advantageous case, whenan AU scrolls up so that real time captions would no longer fit on adisplay screen if all intervening captions were also presented, aprocessor driving the captioned device display may be programmed toautomatically split the display screen so that two caption sets, anarchived set from some time back and a real time set are presentedsimultaneously without presenting intervening captions. To this end, seefor instance, FIG. 64 where an AU interface 2400 is shown that includesa split dual caption screen including an upper screen showing priorcaptions from earlier during an ongoing call and real time instantaneouscaptions. On screen selectable arrow icons 2458 are provided in theprior captions field that are selectable to move back and forth in timeto different prior caption segments.

Dual HU Video

Verbal communication is only one way that people express themselves.Other ways are through gestures, posture, and facial expressions.Telepresence type video enhancements (see 1412 in FIG. 44) can be addedto some captioned device interfaces to enrich HU-AU communications wherean AU has the ability to see an HU in real time and, in at least somecases, during playback of an HU voice segment. One issue withtelepresence type videos is that they typically only give an AU a singleHU view and therefore cannot provide a complete sense non-verbalexpression. To this end, a telepresence video is typically obtained froma vantage point where just the upper torso of an HU can be viewed whereposture and gesture related communication can be gleaned but wherefacial expressions may be more difficult to discern. Here, in some casesa system user observing a telepresence video maybe able to adjust zoomto zoom in on an HU's face to see facial expressions but then theobserver cannot see posture and gestures.

In at least some embodiments it is contemplated that an AU interface maypresent more than one simultaneous telepresence type video to an AU toincrease the amount of non-verbal communication queues that an AU canpick up on during HU communication. Here, a system processor maygenerate two or more HU views using images/video captured by a single HUdevice camera. For instance, the system processor may be located at theAU captioned device in some embodiments. In other embodiments, thesystem processing required for generating two or more HU videos may beat the HU device. In still other cases the system processor may be arelay processor.

In at least some cases a first torso type telepresence video may begenerated and a second facial type telepresence video may be generatedusing images from one or more HU device cameras and both videos may bepresented to the AU simultaneously via a single interface. In thisregard, see FIG. 65 that includes captioned text in field 2500, a torsovideo in field 2502 and a facial video in field 2504. As shown, thefacial video 2504 is positioned just under the captioned device camera2200 so that while the AU is looking at the facial video 2504, camera2200 can obtain an AU video optimal for creating a sense of eye contactfor the HU on the other end of a call.

Where a facial video is generated, the system processor may perform acentering function as part of the video generation process where theprocessor automatically centers the HU's face within the video even ifthe HU is moving laterally or up-down with respect to her device camera.Thus, the HU's face may remain essentially stationary within field 2504so that her facial expressions can be easily observed withoutdistracting movement. In at least some cases a similar centeringfunction may be performed on the torso video representation in field2502.

In at least some cases where an AU device tracks AU sight trajectory,the AU device processor may be programmed to move interface objectsabout on a display screen automatically to optimize the sense of directeye contact with the HU as the AU looks at different objects on theinterface. For instance, in FIG. 65, while the AU is looking at facialvideo field 2504, that field may be presented as shown just under thedevice camera 2200. If AU changes sight trajectory to view caption field2500, the FIG. 65 interface may be rearranged so that the captions filed2500 is moved up to a location just below camera 2200 with facial videofield 2504 moved to a different location on the interface. This processmay be repeated any time the AU changes sight trajectory from one objectto another so that the currently viewed object is located proximate thecamera and others are moved to different locations. The rearrangement ofobjects on the display screen may be gradual so that the rearrangementactivity does not distract the AU.

Confidence Factors for CA Generated Text

Several of the systems described above include features where confidencefactors are generated for ASR engine captions. In at least someembodiments where a CA generates captions (e.g., listens to HU voicesignal and types captions or revoices to voice trained software whichthen generates CA captions), the system may be programmed toautomatically generate confidence factors for each CA generated word orphrase. For instance, in a case where a CA types captions, a systemprocessor may run a parallel ASR engine to generate ASR captions and maygenerate confidence factors associated with each ASR word or phrase. Theprocessor may compare high confidence ASR word captions to CA generatedcaptions and, when there is a mismatch (e.g., on a scale of 1 to 10, adifference of more than 2, 3, 4, or 5), the processor may visuallydistinguish (e.g. highlight, underline, etc.) the CA generated word incaptions that are presented to the CA for error correction. In addition,the processor may present the ASR captioned word or phrase (e.g.,hovering over the possible error text in the CA generated text) forquick user selection.

As another instance, in a case where a CA revoices the HU voice signalto an ASR trained to the CA voice to generate CA captions, the ASRtrained to the CA voice may generate confidence factors for each captionword or phrase based on how many close caption options exist for eachspecific word. When there are several close options, each of which makesgrammatical sense, a confidence factor would be low. Other factors forassessing caption confidence factors for specific words arecontemplated. Here, a system processor would visually distinguish CAgenerated words in captions that have a low confidence factor presentedto the CA for error correction.

In cases where a CA types captions and an ASR generates ASR captions inparallel and an initial ASR caption for a word matches a CA generatedcaption for the same word, if the ASR generates a low confidence factorfor the word, a system processor may be programmed to visuallydistinguish the low confidence word for the CA during error correction.

CA Involvement Based on Pool of Available CAs

In at least some cases it is contemplated that the number of CAsavailable and not captioning calls for a service provider willfluctuate. For instance, where a relay call center has 500 CAs workingduring a morning shift to handle incoming calls, at times essentiallyall (e.g., 90%) of the CAs may be linked to different calls and at othertimes it may be that only 50% of CAs are linked to handle calls. Whenalmost all CAs are linked to calls, there is a possibility that theremainder of CAs may be required to handle additional incoming calls inthe near term. In contrast, when half the available CAs are notcurrently linked to ongoing calls, there is less possibility that thepool of available CAs to handle additional incoming calls will bedepleted. For this reason, in at least some embodiments, the system maybe set up to delink CAs from calls more speedily at some times than atothers based at least in part on the available CAs to handle additionalincoming calls.

For instance, on one hand, in a case where half of all relay center CAsare not linked to ongoing calls and therefore are available to handleincoming calls, even on a call where an ASR is highly accurate, the CAmay remain linked to the call to facilitate error correction as that CAlikely will not be needed to handle any likely near term influx of newcalls.

On the other hand, in a case where 495 out of 500 CAs are currentlylinked to ongoing calls so that only 5 are available to handle newcalls, the system may be programmed to identify many of the 495 callscurrently attended to by CAs as candidates to be switched over to fullASR captioning or some captioning process whereby a CA is only requiredfor a portion of call segments (e.g., where CAs are only linked to thecall for short durations to only handle low confidence factor text andare delinked to handle low confidence call segments on other calls) andmay then either automatically delink CAs from calls or at least portionsof calls or suggest that option to attending CAs. Thus, here, of the 495currently attending CAs, it may be that 120 can be freed up to handleadditional incoming calls.

Here, it is contemplated that the system may have different threshold CAoccupied levels at which different ASR accuracy is required prior toswitching between different captioning processes. For instance, whereless than 70% of CAs are currently handling ongoing call captioning, theASR accuracy level required to switch from CA error correction to fullASR captioning may be high (e.g., 98%) and where 70% or more of CAs arecurrently handling ongoing call captioning, the ASR accuracy levelrequired to switch from CA error correction to full ASR captioning maybe relatively low (e.g., 94%). Here, there may be several CA occupiedlevel thresholds associated with different ASR accuracy levels. Inaddition, there may be different thresholds for switching from full CAtranscription and correction to ASR transcription with CA errorcorrection and then from ASR transcription with CA error correction tofull ASR transcription without CA error correction.

In at least some cases it is contemplated that one CA may be associatedwith two or more ongoing calls simultaneously in cases where CA errorcorrection requirements for the two or more calls are minimal. Thus, forinstance, in a case where error correction is only required 5% of thetime on each of first and second calls, a single CA may be presentedwith HU text from each of the first and second calls for errorcorrection. Here, text from the first and second calls may be presentedin first and second side by side windows or as a single scrolling textwith interleaved text segments from each of the first and second calls.

Up/Down Voice Signal Sampling

In at least some cases third party ASRs only accept audio atparticularly high sample rates (e.g., 16K) while phone lines only carrysmaller rate signals (e.g., 8K audio maximum). Thus, in some cases arelay server receiving an 8K or lower HU voice signal from an AUcaptioned device or directly from an HU phone device may be programmedto automatically convert the received voice signal to a higher samplingrate like 16K prior to sending that signal via the Internet or othercommunication network on the to third party ASR for transcription.

In at least some cases it is contemplated the a low to high ratesampling may be performed by the AU captioned device instead of by therelay server and that the AU captioned device may send the high rate HUvoice signal directly to the third party ASR instead of through therelay server. Here, the advantage is that the cost associated withhigher rate signal transcription is shifted to the AU instead of beingborn by the relay operator. This is important because many AUs will havean unlimited data plan and therefore high rate signals can beaccommodated without additional expense. This should be contrasted witha case where a relay operator pays for data usage on a volume basis asopposed to being based on an unlimited data plan.

In cases where an AU captioned device up samples data from, for instancean 8K voice signal to generate a 16K voice signal which is sent to athird party ASR, in at least some cases ASR transcribed text may betransmitted to the relay server as opposed to the AU captioned device.In other cases the transcribed text may be transmitted back to the AUcaptioned device and then on from there to the relay for errorcorrection. Where ASR text is sent directly to the AU captioned devicethat text may be immediately presented to the AU.

In cases where an AU captioned device up samples the HU voice signal andsends that along to a third party ASR, the AU captioned device may alsosend a lower sample rate signal on to the relay for CA captioning orerror correction. Thus, for instance, the HU voice signal sent to theASR may be 16K while the signal sent to the relay for CA captioningand/or error correction may be 8K. In still other cases the AU captioneddevice may even down sample an HU voice signal (e.g., 4K or 2K) prior tosending along to the relay in order to reduce relay data costs.

There are at least two advantages associated with an AU device upsampling an HU voice signal and sending that signal directly to a thirdparty ASR instead of through a captioning relay. First, captioninglatency is reduced if the voice signal is sent directly to the ASR asopposed to through the relay. Second, as indicated above, datatransmission costs are shifted from the relay operator to the AU and areoften covered by an unlimited data plan.

In at least some cases it is contemplated that an HU phone device thatis internet capable may automatically generate and transmit a 16K (orgreater as captioning requirements evolve and require higher samplingrates over time) HU voice signal directly to a third party ASRcaptioning service and transmit a lower sample rate HU voice signal tothe AU captioned device or the relay for CA captioning and/or errorcorrection. Here, the third party ASR may transmit captions and relateddata back to the HU device, to the AU captioned device and/or to therelay server. Thus, for instance, the ASR may transmit the captions tothe HU device which then transmits the captions to one or each of the AUcaptioned device and the relay. As another instance, the ASR maytransmit captions to the AU captioned device which then retransmits tothe relay or to the relay device which then retransmits to the AUcaptioned device. As still one other instance, the AAU may transmitcaptions to each of the AU captioned device and the relay.

Similarly, captions form an ASR may be passed directly to an AU'scaptioned device and from there on to a relay for error correction.

Other Concepts

In cases where CA captioning delay or error correction lag time arepresented to a CA or an AU (see FIGS. 17, 45, etc.), the lag time may beadjusted downward based on any silent periods within an HU voice signal.For instance, where a CA is captioning HU voice signal that is 25seconds behind a current instant in time and there is a 15 second HUvoice signal silent period within the 25 second period, the delay may berepresented as a 10 second delay. In other cases, a delay may beadjusted downward based on other factors. For instance, where a CA iscorrecting errors in ASR generated text and all ASR generated textbetween the text currently considered (e.g., listened to) by the CA andcurrent text is high confidence factor text, the system may beprogrammed to play back the intervening HU voice signal at double thenormal rate so that the time required to catch up can be reduced and thedelay indicated can be cut back. In still other cases, an anticipatedcaptioning or error correction delay may be increased beyond theduration of HU voice signal time between text currently considered by aCA and a current time where, for instance, line noise in a signal isrelatively high, a CA has had trouble quickly captioning a call, etc.

Conference Calls

In at least some cases it is contemplated that an AU may be on aconference call with two or more HU conferees. Here, the captioningsystem may operate in any of the ways described above where the two ormore voice signals from the CAs are captioned and text is sent back tothe AU's captioned device to be presented to the AU via a displayscreen. Here, one problem that can result is that an AU cannot discernwhich of two or more HU conferees is saying what on the call as thesystem presents text as if the captions are associated with a singleincoming HU voice signal. Remember that an AU is at least hearingimpaired and therefore may not be able to distinguish between differentvoices associated with different textual voice messages which can causeconfusion.

In at least some embodiments the problem of discerning which HU on amulti-HU conference call is saying what is dealt with by identifyingdifferent HU voices as they are received at an AU captioned device or ata relay and then, as text is generated for each of the voice signals,indicating which HU uttered which messages. In at least some cases wherean HU uses a smart phone or other communication device that generateuser identifying information, the HU device will transmit an HUidentifier along with each voice signal transmitted that can be used todistinguish the HU voice signal from other HUs. In some cases the HUidentifier will include a phone number or other device address, a user'sname or a non-specific identifier so that the HU's identity is notdeterminable but the HU voice signal can be distinguished from other HUvoice signals.

In some cases an AU captioned device processor, relay processor or someother system processor may be programmed to distinguish different voicesignals automatically simply based on differences in voicecharacteristics. In hybrid cases a relay or other device may use HUidentifiers to distinguish HU voice signals where those identifiers areavailable and, when HU identifiers are not available for one or more HUvoice signals on a call, a system processor may then use different voicecharacteristics to distinguish other voices on a call. For instance,where there are four HUs on a conference call and two use smart phonesthat provide HU identifiers along with each voice message uttered whiletwo do not, the system would use the HU identifiers to identify the twoassociated voice signals and would use voice characteristics of theother two HUs to distinguish each of those two other voice signals.

In still other cases smart phone or other voice capturing deviceprocessors may be programmed to code each separate HU voice signal andassociated text differently so that each HU voice signal can bedistinguished from all others. For instance, first and second HU voicesignals may be modified so that they have first and second pitches,respectively, so that they are distinguishable by other voice receivingprocessors within the system. Receiving processor can reconvert themodified voice signals back to their original signals for broadcast toCAs or an AU when needed.

In cases where voice signals are distinguished a system processor timestamps the beginning and end times of each voice signal automaticallyand stores the separate voice signal segments, time stamps and HUidentities or identifiers for each segment. The voice segments are thenconverted to text and each text segment is associated with one of thetime stamped voice segments and an associated HU. Next, the textcaptions are presented to the AU in some fashion where each text segmentis presented in a way that the HU that uttered the segment is associatedwith the caption. For instance, in some cases where HU names areavailable (e.g., received form an HU smart phone or the like or storedin an AU device or relay database and associated with specific HU phonenumbers or other calling addresses), each caption segment presented maybe associated with the HU's name. In other cases HU images stored in anAU's captioned device or other system device may be presented along withcaptions. In still other cases where available live videos of HUs may bepresented where captions uttered by each HU are spatially associated onthe AU device display screen with the HU videos.

In particularly advantageous cases captions uttered by different HUs ina sequence will be presented with a temporal aspect to theirarrangement. For instance, in some cases captions and HU identifierswill scroll upward so that new captions are added near the bottom of theAU device display screen. In this regard see for instance FIG. 66 whichshows one exemplary AU device display screen 2600 where a series of fourHU caption segments are presented at 2602, a separate segment uttered byeach of four separate Hus on an ongoing call. An image of each HU ispresented to the right of a corresponding caption segment uttered bythat HU. In the exemplary screenshot, a most recent HU utterance isshown near the bottom at 2604 where that utterance is visuallydistinguished in larger font and in a “call-out” box 2604 to distinguishfrom prior captions and the HU icon 2608 corresponding to the mostrecent utterance is shown larger as well. As shown, captions arearranged temporally with the oldest at the top and most recent at thebottom. One advantage to the screenshot in FIG. 66 is that an AU simplyhas to fix her gaze near the bottom of the screenshot at the areagenerally indicated by box 2604 to see most recent captions as all newcaptions are presented at that location.

FIG. 67 shows another exemplary AU device screenshot 2620 where HU voicesignal captions are temporally arranged. In screenshot 2620, HU voicecaptions are again arranged with oldest near the top of the screen andmost recent near the bottom but here captions are arranged by HU indifferent HU specific columns. Again, the screenshot corresponds to afour HU conference call and therefore four HUs are represented in thescreen shot 2620 spaced apart along an upper row at the top of an HUcolumn 2622, 2624, 2626 and 2628. Each captioned voice signalcorresponding to a specific user is located within the column below theHU image. Again, a most recent HU utterance at the bottom of thescreenshot is visually distinguished (e.g., different font or font sizeor highlighting, etc.) to call attention thereto and the HU imageassociated therewith is enlarged to indicate current caller. In othercases a background area surrounding an HU image may be distinguished(e.g., highlighted or otherwise presented with a distinguishingappearance) to indicate a current speaker or even the volume associatedwith the voice signal received from the speaker (see FIG. 67).

FIG. 68 is similar to FIG. 67, albeit where most recent text ispresented at the top end of the AU display screen and prior presentedtext scrolls downward as new text is generated. In FIG. 68, the new textis specifically presented at a central location along the top edge ofthe display screen so that it is located just below the display camera2641 so that as the AU is viewing the new text, if video of the AU iscaptured to present to the HUs, the AU appears to be looking directly atthe HUs in the video while viewing the most recent text. Here, each HUis associated with a different text column on the screen shot with HUimages presented along the lower edge of the display screen inassociated columns. Once a most recent text segment is completed, thattext slides over and into the column associated with the HU thatgenerated the associated voice signal. Thus, for instance, in FIG. 68,once the text at 2621 ages, that text segment is moved into the rightcolumn as indicated by arrow 2631 to be associated with HU image 2629.

In at least some cases a single ASR or CA or a combination of a singleASR and a single CA may operate to generate captions for a plurality ofHUs on a conference call. To this end, for instance, a CA may simplyreceive a constant stream of HU voice signals from two or more HUs andcontinually caption those signals as if they were generated by a singleHU and the system may automatically associate specific caption text withspecific HU voice signal segments and associated HUs so that AU devicescreenshots akin to those in one of FIGS. 66 through 68 may be presentedto the AU. Similarly, an ASR engine may simply receive a continuous flowof HU voice signals from many HUs voices and transcribe time stampedtext captions therefore as if the HU voice signals were from a singleHU. Then, a system processor may use the ASR text caption time stamps toassociate the ASR captions with the HU voice signal segments andspecific HUs previously stored so that screenshots akin to thosedescribed above where captions are associated with specific HUs can bepresented. Thus, here, the CA simply captions all voice signals and thesystem automatically associates different text captions with thedifferent HUs on the call without any input from the CA.

Where ASR generated HU specific text is presented as in, for instance,FIG. 67, that ASR text may be presented to an error correcting CA eitheras a constant flow of captions that are not associated with specific HUsor in an HU associated way as shown in, for instance, FIG. 67. Whenpresented as a constant flow of captions not associated with specificHUs, when a CA corrects an error in the captions, the relay sends errorcorrection text to the AU for in line correct as described above and theAU captioned device then corrects the associated text segment on the AUdevice display. When the HU captions are presented in HU associated formas in FIG. 67, error corrections would be applied to appropriate captionsegments as well.

In other cases when a multi-HU conference call occurs where HU voicesignals can be individually discerned, a separate ASR may be assigned toeach distinguishable voice signal. By assigning a specific ASR to aspecific HU voice signal, the ASR can train during a call to thespecific HU voice so that eventually the ASRs may be able to take overthe entire call so that one CA or a reduced CA role can be implemented.Similarly, where separate ASRs are assigned to different HU voicesignals on the same call, one or more of the ASRs may take over full orpartial captioning duties from a CA that correspond to one or more ofthe HU voice signals while other HU voice signals continue to becaptioned and/or error corrected by CAs instead of ASRs. For instance,where first through fourth ASRs operate on first through fourth HU voicesignals initially, if the first and second ASRs become accurate enoughto take over captioning entirely from a CA, those ASRs may automaticallyor at CA discretion take over captioning of the first and secondassociated voice signals while the third and fourth ASRs continue totrain on the third and fourth HU voice signals.

In some cases, a separate CA for captions or error corrections may beassigned to each distinguishable voice signal. In still other cases, thenumber of CAs assigned to a conference call may be dynamic and be afunction of any of several factors including number of HUs linked to thecall, speaking rates of HUs, call quality characteristics (e.g., noiseon the line), etc. In some cases a single ASR may feed two or more errorcorrecting CAs on a single conference call. For instance, where four HUsare linked to one conference call, first and second CAs may handle errorcorrections for first and second HUs and third and fourth HUs,respectively. In each case, as error corrections are made, the systemautomatically sorts out which captions need correcting on the AU deviceand makes in line corrections accordingly.

Referring to FIG. 69, a teleconference process 2700 for supporting an AUwhen conferring with more than one HU is illustrated. At step 2702 anAU's captioned device is linked to a multi-HU conference call. At block2704, an AU captioned device receives HU voice signals from a pluralityof HU communication devices where each signal includes an HU identifieror can otherwise be associated with a specific one of the HUs on thecall (e.g., via voice recognition based on voice characteristics). At2706, the AU device assigns start and end time stamps to each one of thevoice signal segments. Here, a segment will typically be a speaking turnfor an associated HU (e.g., a single persistent speaking duration forone HU prior to another HU speaking). In other cases, however, a voicesignal segment may be shorter (e.g., on the order of 2-4 seconds ofvoice signal). At 2708 the voice signals are transmitted to the relaywith HU identifiers and the voice signals are provided to an ASR (e.g.,directly, through the relay, at the relay, etc.).

Referring still to FIG. 69, at 2710 time stamped ASR text is received atthe AU captioned device. At 2712 the AU captioned device correlates theASR text with specific HUs based on the time stamps and presents the ASRtext to the AU via the captioned device display as uncorrected text.

Referring again to FIG. 69, at block 2714 the relay receives timestamped ASR text also which is presented 2716 to the CA via aworkstation display while the associated HU voice signal is broadcast tothe CA for error recognition. Error corrections are transmitted with thetime stamps to the AU captioned device at 2718. The captioned deviceuses the time stamps to identify which presented captions need to becorrected and then performs in line corrections at block 2720.

In some cases it is contemplated that an ASR may be more accurate forsome HU voice signals than others on a conference call. For instance,where four HUs participate in a conference call with one AU, an ASRhandling all of the HU voice signals may eventually train to the pointwhere accuracy for the first and second HU voice signals is above athreshold level and for third and fourth HU voice signals is below thethreshold. Here, the system may automatically adapt so that CA captionsor CA error corrections or both are only allowed for the third andfourth voice signals that have accuracy ratings below the thresholdlevel and are disallowed for the more accurate first and second voicesignals.

In this case, a CA workstation may present all of the ASR text captionsfor all the HUs as in FIG. 67 but only allow changes to the captionsassociated with the third and fourth HU voice signals to lessen thecaptioning burden on the CA or to cut out one or more of the CAsassigned to a call. Here, where the ASR caption accuracy exceeds athreshold level for first and second of four HU voices so that CAcaption tasks are not required for those voices, the captioning burdenon a CA would be substantially reduced as the sheer volume of CAcaptioning or error correction would be halved in many cases and periodsduring which first and second HUs are speaking could be treated likesilent periods where those durations could be completely skipped if a CAwere to fall behind in captioning or correcting for the third and fourthvoices. In addition, because the captioning/correction burden isreduced, in at least some cases when an ASR takes over for at least aportion of a conference call, the system may automatically switch from afirst highly skilled or more seasoned CA to a less skilled or lesstrained CA that should be able to handle the reducedcaptioning/correction load sufficiently well.

AU-AU Communication Captioning

In at least some cases it is contemplated that first and second AUs mayconfer using first and second AU captioned devices where each AUrequires captioning of the other AU's voice signal. Here, in some caseseach AU may have a fully functioning captioned device capable of linkingto a relay to provide the other AU's voice signal to the relay and forreceiving caption text back to present to an associated AU (e.g., thefirst captioned device used by the first AU presents text associatedwith the second AU's voice signals and the second captioned device usedby the second AU presents text associated with the first AU's voicesignals).

In other cases, however, it is contemplated that the AU captioneddevices may be programmed so that one of the captioned devices operatesas a primary captioned device that links to the relay and the otheroperates like a secondary captioned device that only links to the relaythrough the primary captioned device. In this regard, in at least somecases, when the secondary captioned device captures a second AU's voicesignal and transmits that signal to the primary captioned device, theprimary captioned device may be programmed to transmit that second AUvoice signal to a relay for captioning. In addition, the secondarycaptioned device transmits a “captioned device” signal to the firstcaptioned device indicating that the second AU's device is in factanother captioned device.

In addition, the primary captioned device is programmed to recognizewhen a captioned device signal is received from another communicationdevice (e.g., in this case the secondary captioned device) and tothereby recognize when the other communication device is in fact acaptioned device. Upon recognizing that the other device is a captioneddevice, the primary captioned device is programmed to automaticallytransmit any first AU voice signal captured by the primary captioneddevice to the relay for captioning. Thus, here, when two AU captioneddevices are linked for caption assisted voice communications, a primarycaptioned device transmits each of the first and second AU voice signalsto the relay for captioning. Here, an AU indicator or identifier may betransmitted with each voice signal segment that associates the segmentwith a specific one of the AUs so that the relay and can distinguish thefirst and second AU voice signals.

It should be appreciated that various aspects of the above systemsprovide many different advantages including, in at least some cases,increased captioning speed, increased captioning accuracy, reducedburden on captioning CAs, reduced captioning cost and increased AU andHU privacy as well as additional captioning interface features for eachof the AU, CA and, in some cases, the HU involved in a captioned call.To this end, some of the described aspects and features that affordthese advantages are listed hereafter.

Aspects and features that increase captioning speed include but are notlimited to the following:

-   -   (1) CA error corrections presented more quickly to AU because of        ASR.    -   (2) Processor speeding through high confidence factor text at        expedited (e.g., double) rate.    -   (3) Processor providing options to CA for low confidence factor        text.    -   (4) Limit CA error correction window to near real time ASR        text—no reason to error correct text AU will never see.    -   (5) Switch out CA when too far behind and bring in second CA to        pick up slack.    -   (6) Switch out more skilled CA for CA that is delayed for some        reasons.    -   (7) CA applies experience to decide which caption type to use        for best results.    -   (8) AU or HU device captioning to increase speed of initial ASR        text.    -   (9) Better CA interface with stationary fields.    -   (10) Automatic acceptance of ASR text when not acted upon.    -   (11) Scored CAs for specific calls based on demands.    -   (12) Switching between CAs when a first CA is struggling to meet        captioning speed and accuracy requirements.    -   (13) Applying two CAs to a single HU voice signal to expedite        captioning at least at times.        Aspects and features of the present disclosure that increase        accuracy include but are not limited to the following:    -   (1) Generate ASR captions first with CA error correction. Here,        the CA is more accurate because less overall burden associate        with transcribing and correcting captions.    -   (2) The ASR trains on CA error corrections.    -   (3) Commence captioning using remote ASR and local ASR in        parallel. Here, the remote ASR may be most accurate initially        but in some cases untrainable using CA error corrections.        However, a local ASR is more likely to be trainable and hence        should be more accurate as it trains.    -   (4) Running multiple ASRs in parallel and selecting most        accurate automatically.    -   (5) Provide guidance to CA for switching between captioning        processes to lead to more accurate process.    -   (6) Run metrics and tests to encourage CAs to strive for        accuracy.    -   (7) Select ASRs based on HU, on voice type or characteristics,        on call characteristics (e.g., line noise level, high or low        definition audio, etc.).    -   (8) CA applies experience to decide which captioning process to        use for best results.    -   (9) AU communication device captioning at least some HU voice        signal segments to increase accuracy.    -   (10) HU communication device captioning at least some HU voice        signal segments to increase accuracy.    -   (11) Scored CAs for specific calls based on demands.    -   (12) Having one CA generate text and a second CA error correct        that text.        Aspects and features of the present disclosure that reduce        captioning and error correcting burden on a CA include but are        not limited to the following:    -   (1) ASR generating initial text that is provided to the CA as        well as the AU.    -   (2) ASR indicating low confidence factor (CF) ASR text.    -   (3) ASR presenting options for low CF text that are selectable        by a CA to error correct.    -   (4) A first CA generating HU voice signal captions and a second        CA error correcting those captions.    -   (5) Processor only presenting low CF text and HU voice to CA at        times (e.g., when error correction delay is behind) or        persistently (always take out high CF text from consideration).    -   (6) Detect CA stress level and build in recuperation time when        needed as opposed to when scheduled.    -   (7) Interface where CA needs to do nothing to accept ASR text if        it is accurate.        Aspects and features of the present disclosure that reduces        overall cost include but are not limited to the following:    -   (1) Eliminate all CAs from a call when ASR captioning accuracy        exceeds an acceptable threshold level.    -   (2) When local ASR accuracy persistently exceeds remote ASR        accuracy, delink the call from the remote ASR and only use the        local ASRT to generate captioned text for AU consumption or CA        error correction.    -   (3) Once ASR accuracy exceeds a threshold level, eliminate a        captioning CA and only retain an error correcting CA on a call.    -   (4) Only facilitate CA error correction when an AU looks at        captions (e.g., as detected by a camera or other AU device        sensor).    -   (5) CA applies experience to decide which caption type to use        for best results.    -   (6) CA eye tracking interface reducing cost by minimizing CA        strain.    -   (7) Having one CA handle at least portions of two simultaneous        ongoing calls.    -   (8) Have one CA handle captioning and/or error correction for        two or more HUs that speak on a single conference call.        Aspects and features of the present disclosure that increase        privacy include but are not limited to the following:    -   (1) Switch out CAs periodically so each CA only perceives a        portion of a call.    -   (2) CA only presented low CF text for error correction so CA can        only perceive part of a conversation.    -   (3) CAs only error correct when AU looking at captions (as        sensed by camera or other AU device sensor).    -   (4) AU can select complete privacy. Here in another case, when        complete privacy needed, CA may simply correct low CF text. In        alternative, system may indicate low CF text to AU and allow AU        to request CA error correction of any low CF text.        Additional aspects and features of the present disclosure that        add value for an AU include but are not limited to the        following:    -   (1) Privacy option. (FIG. 26)    -   (2) Better understanding of caption process.    -   (3) CA caption option selection.    -   (4) Understanding of low CF words and phrases.    -   (5) Ability to catch up when desired.    -   (6) Understanding of where CA is in error correction.    -   (7) Faster initial text.    -   (8) Faster error correction.    -   (9) Ability to split screen and see prior captions and ongoing        real time captions.    -   (10) Understand caption delay.    -   (11) Option to adjust between speed and accuracy.    -   (12) Other information indicating emotions.    -   (13) Understanding of current accuracy level.    -   (14) AU and HU captions with AU captions generated by an ASR to        increase contextual understanding of complete conversation.    -   (15) Understand line quality and other call characteristics        (FIG. 24).    -   (16) Dual HU view for full communication (FIG. 65).        Additional aspects and features of the present disclosure that        add additional value for a CA include but are not limited to the        following:    -   (1) Option to switch between complete ASR, CA captioning and        error correction, and ASR captioning with CA error correction.    -   (2) Ability to understand turn taking between AU and HU (FIG.        23).    -   (3) Ability to adjust audio or ASR text first and alignment        generally. (FIG. 25).    -   (4) Ability to track captioned text currently broadcast (FIG.        39).    -   (5) Ability to see low CF text (FIG. 40).    -   (6) Ability to track real time metrics (FIG. 40).    -   (7) Ability to rapidly progress through expedited HU voice for        high CF text (FIG. 40).    -   (8) Stationary line and low CF fields (FIGS. 44, 44A).    -   (9) Coaching of CA to change caption method (FIGS. 47; 50).        Additional aspects and features of the present disclosure that        add additional value for a HU include but are not limited to the        following:    -   (1) Coaching on speed, annunciation, etc. (FIG. 27).    -   (2) Understand AU progress (word broadcast, where error        corrections are at, which words have been presented as text to        AU, etc.) (FIG. 27).    -   (3) Ability to initiate a caption process change based on        caption accuracy feedback.        Additional aspects and features of the present disclosure that        add additional value for a captioning system administrator        include but are not limited to the following:    -   (1) Ability to enhance CA caption and error correction training.    -   (2) Metrics to track CA activities, speed, accuracy.    -   (3) Scoring system to rate CAs.

In at least some cases where ASR text is presented to an AU and an HUvoice signal is delayed at least somewhat so that ASR text and HU voicecan be presented more synchronously or precisely synchronously to an AU,the amount of voice delay may be adaptive and automatically changed bythe system based on a number of factors. Similarly, in cases where ASRand HU voice are delayed so that at least some ASR error correction canoccur prior to presentation to an AU, the amount of voice and ASRcaption delay may be adaptive and automatically changed by the systembased on several factors. For instance, HU voice broadcast and ASRcaptions may be dynamically adapted based on the level of ASR errorcorrection that occurs prior to a current time during an ongoing call.For example, in cases where a call is progressing and no ASR errorcorrections occur during an initial 2 minute period, the HU voice andASR caption delay may be minimized so that the captions and HU voice arepresented relatively quickly (e.g., either immediately upon occurrenceor, in some cases, where the HU voice signal is slightly delayed so thatit is aligned in time with ASR captions). In other cases where an ASRmakes substantial corrections in initial captions, delays may beincreased so that at least some of the ASR corrections occur prior tocaption and related HU voice presentation to the AU. Here it iscontemplated that the adaptive delay would change during an ongoing callbased on the degree of error corrections required.

The delay may be based on the level or error correction to initial ASRcaptions during the entire prior duration of an ongoing call, during amost recent rolling period of an ongoing call or during any otherperiod. As another example, in a case where ASR error corrections occurwithin X seconds (e.g., 5 seconds) of generation of initial ASR text,delay may be based on the degree of error correction during a duration(e.g., one minute) that ends X seconds prior to a current time.

In other cases adaptive delay may be based on other factors likeconfidence factors associated with initial ASR generated text, contentin HU and AU voice messages or other parameters. Parameters used toassess and adapt voice broadcast and caption presentation delays will bereferred to hereinafter as caption quality factors.

While embodiments are described above where specific CAs are associatedwith preferred and non-preferred lists or optimal and non-optimal listsfor specific AUs, it should be appreciated that the similar preferencesor optimality ratings may be ascribed to different captioning processes.For instance, a first AU may routinely rank ASR captioning poorly butfull CA captioning highly and, in that case, the system mayautomatically configure so that all calls for the first AU are handledvia full CA captioning. For a second AU, the system may automaticallygenerate caption confidence factors and use those factors to determinethat the mix of captioning speed and accuracy is almost always best wheninitial captions are generated via an ASR system and one of 25 CAs thatare optimal for the second AU is assigned to perform error correctionson the initial caption text.

To apprise the public of the scope of the present invention thefollowing claims are made.

What is claimed is:
 1. A method for captioning a hearing user's (HU's)voice during a call with an assisted user (AU), the method comprisingthe steps of: (a) storing a plurality of HU voice profiles andassociated voice models for each of a plurality of HU device identifiersin a voice recognition database; (b) subsequent to receiving an incomingcall at an AU communication device; (c) identifying an HU deviceidentifier associated with the HU device used to initiate the incomingcall; (d) receiving HU voice signal during the call; (e) comparing theHU voice signal to HU voice profiles associated with the HU deviceidentifier to identify a current HU voice profile associated with the HUvoice signal; (f) selecting the voice model that is associated with thecurrent HU voice profile as a current voice model; (g) using the currentvoice model to transcribe the HU voice signal to text; (h) presentingthe text on a display screen of the AU communication device; and (i)repeating steps (d) through (h) to continually identify a current HUvoice model and use the current voice model to transcribe.
 2. The methodof claim 1 further including the steps of, determining that the HU voicesignal does not match any of the stored HU voice profiles, using adefault voice model to generate text and training the default voicemodel to generate a new voice model.
 3. The method of claim 2 furtherincluding using the HU voice signal to generate a new voice profile andstoring the new voice profile and the new voice model in the memoryvoice recognition database for subsequent use.
 4. The method of claim 3wherein the new voice model and new voice profile are stored along withthe HU device identifier.
 5. The method of claim 2 further including,upon determining that the HU voice signal does not match any of thestored HU voice profiles, having a call assistant (CA) transcribe the HUvoice signal to text which is presented via the display screen while thenew voice model is trained.
 6. The method of claim 5 further includingmonitoring accuracy of the new voice model during training and, onceaccuracy exceeds a threshold level, switching from the CA generated textto use the new voice model to generate the text that is presented viathe display screen.
 7. The method of claim 1 wherein the AUcommunication device links to a remote relay for captioning services andwherein the HU voice profiles and voice models are stored at the relay.8. The method of claim 1 wherein the HU voice profiles and voice modelsare stored in the AU communication device.
 9. The method of claim 1wherein each HU voice model is periodically modified as additional HUvoice signal is processed to generate text.
 10. The method of claim 9wherein a call assistant CA corrects errors in the text and the systemautomatically modifies an HU voice model based on CA error corrections.11. The method of claim 2 wherein the step of using a default voicemodel includes identifying HU voice signal characteristics and selectingone of a plurality of default voice models based on the identified HUvoice signal characteristics.
 12. The method of claim 1 wherein the HUcommunication device identifier is a phone number.
 13. The method ofclaim 1 wherein the HU communication device identifier is a networkaddress.
 14. A method for captioning a hearing user's (HU's) voiceduring a call with an assisted user (AU), the method comprising thesteps of: during a voice call between an HU communication device and anAU communication device, receiving an HU voice signal; using anautomated speech recognition (ASR) engine to generate caption text forthe HU voice signal; storing the caption text in a memory device withoutinitially presenting the text captions; receiving a caption activationsignal from the assisted user at a first time; and presenting thecaption text corresponding to a period prior to the first time to the AUvia an AU communication device display screen.
 15. The method of claim14 wherein the step of claim 14 wherein the AU communication deviceincludes a user interface that includes a caption activation featurethat the AU may use to generate the caption activation signal.
 16. Themethod of claim 14 wherein the period prior to the first time includes aduration of 20 seconds or less.
 17. The method of claim 14 furtherincluding broadcasting the HU voice signal in essentially real time tothe AU via a speaker.
 18. The method of claim 17 further including, uponreceiving the caption activation signal, generating caption text for theHU voice signal as the HU voice signal is received and presenting thecontinuing caption text via the device display as that text isgenerated.
 19. A method for captioning a hearing user's (HU's) voiceduring a call with an assisted user (AU), the method comprising thesteps of: during a voice call between an HU communication device and anAU communication device, receiving an HU voice signal; broadcasting theHU voice signal via a speaker to the AU; receiving a caption activationsignal from the assisted user at a first time; in response to receivingthe caption activation signal; using an automated speech recognition(ASR) engine to generate ASR text for the HU voice signal; forming alink to a call assistant (CA) at a remote relay; transmitting the HUvoice signal to the CA for transcription to CA generated text; receivingthe CA generated text at the AU communication device; prior to receivingthe CA generated text at the AU communication device, presenting the ASRtext via an AU communication device display screen; and subsequent toreceiving the CA generated text at the AU communication device,presenting the CA generated text via the AU communication device displayscreen.